How is accuracy on OPUS-100 computed?

nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.

MIT License

813 stars 84 forks source link

Hi! Thanks for this library.

Since there is no notion of documents in the OPUS-100 dataset it is not clear to me how accuracy is computed. I tried a naive approach using pairwise joining of sentences:

from datasets import load_dataset
import pysbd

if __name__ == "__main__":
    sentences = [
        sample["de"].strip()
        for sample in load_dataset("opus100", "de-en", split="test")["translation"]
    ]

    correct = 0
    total = 0

    segmenter = pysbd.Segmenter(language="de")

    for sent1, sent2 in zip(sentences, sentences[1:]):
        out = tuple(
            s.strip() for s in segmenter.segment(sent1 + " " + sent2)
        )

        total += 1

        if out == (sent1, sent2):
            correct += 1

    print(f"{correct}/{total} = {correct / total}")

But I get 1011/1999 = 50.6% Accuracy which is not close to the 80.95% Accuracy reported in the paper.

Thanks for any help!

from datasets import load_dataset import pysbd if __name__ == "__main__": sentences = [ sample["de"].strip() for sample in load_dataset("opus100", "de-en", split="test")["translation"] ] text = " ".join(sentences) total = len(sentences) segmenter = pysbd.Segmenter(language="de") segments = segmenter.segment(text) correct = len(set(sentences).intersection(set(sentences))) print(f"{correct}/{total} = {correct / total}")

nipunsadvilkar / pySBD

How is accuracy on OPUS-100 computed? #117