nipunsadvilkar / pySBD

🐍💯pySBD (Python Sentence Boundary Disambiguation) is a rule-based sentence boundary detection that works out-of-the-box.
MIT License
813 stars 84 forks source link

How is accuracy on OPUS-100 computed? #117

Open bminixhofer opened 2 years ago

bminixhofer commented 2 years ago

Hi! Thanks for this library.

Since there is no notion of documents in the OPUS-100 dataset it is not clear to me how accuracy is computed. I tried a naive approach using pairwise joining of sentences:

from datasets import load_dataset
import pysbd

if __name__ == "__main__":
    sentences = [
        sample["de"].strip()
        for sample in load_dataset("opus100", "de-en", split="test")["translation"]
    ]

    correct = 0
    total = 0

    segmenter = pysbd.Segmenter(language="de")

    for sent1, sent2 in zip(sentences, sentences[1:]):
        out = tuple(
            s.strip() for s in segmenter.segment(sent1 + " " + sent2)
        )

        total += 1

        if out == (sent1, sent2):
            correct += 1

    print(f"{correct}/{total} = {correct / total}")

But I get 1011/1999 = 50.6% Accuracy which is not close to the 80.95% Accuracy reported in the paper.

Thanks for any help!

nipunsadvilkar commented 2 years ago

Hey @bminixhofer,

I don't remember distinctively but it was something like this:

from datasets import load_dataset
import pysbd

if __name__ == "__main__":
    sentences = [
        sample["de"].strip()
        for sample in load_dataset("opus100", "de-en", split="test")["translation"]
    ]
    text = " ".join(sentences)
    total = len(sentences)

    segmenter = pysbd.Segmenter(language="de")
    segments = segmenter.segment(text)
    correct = len(set(sentences).intersection(set(sentences)))
    print(f"{correct}/{total} = {correct / total}")

Also, note that I didn't use datasets but OPUS dataset in raw format by downloading it from official source - https://opus.nlpl.eu/opus-100.php