Dataset - Githubissues

NilesJiang commented 3 years ago

Hi lena-voita and RachitBansal,

I am trying to reproduce the experiment using the WMT2018 (the Yandex corpus, EN-RU). However, the result I got wasn't very satisfying.

I guess I might have chosen an improper dataset since the Yandex corpus (~120Mb) is much smaller than the OpenSubtitle2018 (EN-RU).

Would you mind specifying the WMT dataset you were using? Thank you so much for reading this long.

Faibk commented 2 years ago

I am working on a student project for my university studies and would like to follow up on this. The WMT2018 (https://www.statmt.org/wmt18/translation-task.html#download) offically does not include the EN-FR pair, but the corpora that are included do contain these languages after download. So I am assuming the set of datasets used are:

EN-RU:

commoncrawl/commoncrawl.ru-en.ru commoncrawl/commoncrawl.ru-en.en
news-commentary-v13/news-commentary-v13.ru-en.ru news-commentary-v13/news-commentary-v13.ru-en.en
wiki-titles/ru-en/wiki.ru-en
yandex/corpus.en_ru.1m.ru yandex/corpus.en_ru.1m.en

EN-DE:

commoncrawl/commoncrawl.de-en.de commoncrawl/commoncrawl.de-en.en
europarl-v7/europarl-v7.de-en.de europarl-v7/europarl-v7.de-en.en
news-commentary-v13/news-commentary-v13.de-en.de news-commentary-v13/news-commentary-v13.de-en.en
rapid2016/rapid2016.de-en.de rapid2016/rapid2016.de-en.en

EN-FR:

commoncrawl/commoncrawl.fr-en.fr commoncrawl/commoncrawl.fr-en.en
europarl-v7/europarl-v7.fr-en.fr europarl-v7/europarl-v7.fr-en.en

Can you confirm this is the correct set of corpora?

Also, the paper states that only 2.5m samples were used - the size of the accumulated EN-RU corpora (which is the smallest of the three language pairs with exactly 2.558.077 samples). How are the other two language pair corpora cut down to 2.5m? By picking random from the source datasets I listed above, picking the first N lines, or by concatenating them and picking the first 2.5m?

Can you kindly provide this information or the exact dataset used for reproducing?

Thanks and best regards!

geekayy05 commented 7 months ago

hey! I am having issues with figuring out what is VOCAB file here and where to get it from? can someone please help me with that?

lena-voita / the-story-of-heads

Dataset #6