Open trina731 opened 3 years ago
Hi ! Thanks for reporting
By looking at the raw news-commentary-v14.en-kk.tsv
file, it looks like there are at least 17 lines with this issue.
Moreover these issues are not always the same:
kk
text and must be appended at the end of the kk
text of the next linekk
text and must be appended at the end of the kk
text of the previous linekk
texts and must be inserted at the beginning of the kk
text of the next lineIt would be nice to have a corrected version of this file ! The file is available in the wmt/news-commentary
repository on the Datasets Hub here:
https://huggingface.co/datasets/wmt/news-commentary/tree/main/v14/training
Then maybe we can notify the WMT authors and host the corrected version somewhere
In addition to the bug of languages being switched from Issue @415, there are incorrect translations in the dataset because the English-Kazakh translations have a one off formatting error.
The News Commentary v14 parallel data set for kk-en from http://www.statmt.org/wmt19/translation-task.html has a bug here:
As you can see, line 95 has only the Kazakh translation which should be part of line 96. This causes all of the following English-Kazakh translation pairs to be one off rendering ALL of those translations incorrect. This issue was not fixed when the dataset was imported to Huggingface. By running this code
we get:
which shows that the issue still persists in the Huggingface dataset. The Kazakh sentence matches up to the next English sentence in the dataset instead of the current one.
Please let me know if there's you have any ideas to fix this one-off error from the dataset or if this can be fixed by Huggingface.