Open bminixhofer opened 2 years ago
Hey @bminixhofer,
I don't remember distinctively but it was something like this:
from datasets import load_dataset
import pysbd
if __name__ == "__main__":
sentences = [
sample["de"].strip()
for sample in load_dataset("opus100", "de-en", split="test")["translation"]
]
text = " ".join(sentences)
total = len(sentences)
segmenter = pysbd.Segmenter(language="de")
segments = segmenter.segment(text)
correct = len(set(sentences).intersection(set(sentences)))
print(f"{correct}/{total} = {correct / total}")
Also, note that I didn't use datasets
but OPUS dataset in raw format by downloading it from official source - https://opus.nlpl.eu/opus-100.php
Hi! Thanks for this library.
Since there is no notion of documents in the OPUS-100 dataset it is not clear to me how accuracy is computed. I tried a naive approach using pairwise joining of sentences:
But I get
1011/1999
= 50.6% Accuracy which is not close to the 80.95% Accuracy reported in the paper.Thanks for any help!