huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.13k stars 2.66k forks source link

WMT19 Dataset for Kazakh-English is not formatted correctly #2106

Open trina731 opened 3 years ago

trina731 commented 3 years ago

In addition to the bug of languages being switched from Issue @415, there are incorrect translations in the dataset because the English-Kazakh translations have a one off formatting error.

The News Commentary v14 parallel data set for kk-en from http://www.statmt.org/wmt19/translation-task.html has a bug here:

Line 94. The Swiss National Bank, for its part, has been battling with the deflationary effects of the franc’s dramatic appreciation over the past few years. Швейцарияның Ұлттық банкі өз тарапынан, соңғы бірнеше жыл ішінде франк құнының қатты өсуінің дефляциялық әсерімен күресіп келеді.

Line 95. Дефляциялық күштер 2008 жылы терең және ұзаққа созылған жаһандық дағдарысқа байланысты орын алған ірі экономикалық және қаржылық орын алмасулардың арқасында босатылды. Жеке қарыз қаражаты үлесінің қысқаруы орталық банктің рефляцияға жұмсалған күш-жігеріне тұрақты соққан қарсы желдей болды.

Line 96. The deflationary forces were unleashed by the major economic and financial dislocations associated with the deep and protracted global crisis that erupted in 2008. Private deleveraging became a steady headwind to central bank efforts to reflate. 2009 жылы, алдыңғы қатарлы экономикалардың шамамен үштен бірі бағаның төмендеуін көрсетті, бұл соғыстан кейінгі жоғары деңгей болды.

As you can see, line 95 has only the Kazakh translation which should be part of line 96. This causes all of the following English-Kazakh translation pairs to be one off rendering ALL of those translations incorrect. This issue was not fixed when the dataset was imported to Huggingface. By running this code

import datasets
from datasets import load_dataset
dataset = load_dataset('wmt19', 'kk-en')
for key in dataset['train']['translation']:
    if 'The deflationary forces were unleashed by the major economic and financial dislocations associated with the deep and protracted global crisis that erupted in 2008.' in key['kk']:
        print(key['en'])
        print(key['kk'])
        break

we get:

2009 жылы, алдыңғы қатарлы экономикалардың шамамен үштен бірі бағаның төмендеуін көрсетті, бұл соғыстан кейінгі жоғары деңгей болды. The deflationary forces were unleashed by the major economic and financial dislocations associated with the deep and protracted global crisis that erupted in 2008. Private deleveraging became a steady headwind to central bank efforts to reflate.

which shows that the issue still persists in the Huggingface dataset. The Kazakh sentence matches up to the next English sentence in the dataset instead of the current one.

Please let me know if there's you have any ideas to fix this one-off error from the dataset or if this can be fixed by Huggingface.

lhoestq commented 3 years ago

Hi ! Thanks for reporting

By looking at the raw news-commentary-v14.en-kk.tsv file, it looks like there are at least 17 lines with this issue. Moreover these issues are not always the same:

It would be nice to have a corrected version of this file ! The file is available in the wmt/news-commentary repository on the Datasets Hub here: https://huggingface.co/datasets/wmt/news-commentary/tree/main/v14/training

Then maybe we can notify the WMT authors and host the corrected version somewhere