thammegowda / mtdata

A tool that locates, downloads, and extracts machine translation corpora
https://pypi.org/project/mtdata/
Apache License 2.0
147 stars 22 forks source link

Add datasets listed by Stanford NMT #84

Closed thammegowda closed 2 years ago

thammegowda commented 2 years ago

https://nlp.stanford.edu/projects/nmt/

I see these datasets used in some papers. Let's add them in. Note: they are preprocessed

Preprocessed Data

    WMT'15 English-Czech data [Large]

        Train (15.8M sentence pairs): [train.en] [train.cs]
        Test: [newstest2013.en] [newstest2013.cs] [newstest2014.en] [newstest2014.cs] [newstest2015.en] [newstest2015.cs]
        Word Vocabularies (top frequent words): [vocab.1K.en] [vocab.1K.cs] [vocab.10K.en] [vocab.10K.cs] [vocab.20K.en] [vocab.20K.cs] [vocab.50K.en] [vocab.50K.cs]
        Dictionary (extracted from alignment data): [dict.en-cs]
        Character Vocabularies: [vocab.char.200.en] [vocab.char.200.cs]
        Note: we used this dataset in our ACL'16 paper [bib]. 

    WMT'14 English-German data [Medium]

        Train (4.5M sentence pairs): [train.en] [train.de]
        Test: [newstest2012.en] [newstest2012.de] [newstest2013.en] [newstest2013.de] [newstest2014.en] [newstest2014.de] [newstest2015.en] [newstest2015.de]
        Vocabularies (top 50K frequent words): [vocab.50K.en] [vocab.50K.de]
        Dictionary (extracted from alignment data): [dict.en-de]
        Note: we used this dataset in our EMNLP'15 paper [bib].
        Also, for historical reasons, we split compound words, e.g., "rich-text format" --> rich ##AT##-##AT## text format. 

    IWSLT'15 English-Vietnamese data [Small]

        Train (133K sentence pairs): [train.en] [train.vi]
        Test: [tst2012.en] [tst2012.vi] [tst2013.en] [tst2013.vi]
        Vocabularies (top 50K frequent words): [vocab.en] [vocab.vi]
        Dictionary (extracted from alignment data): [dict.en-vi]
        Note: we used this dataset in our IWSLT'15 paper [bib]. 
kpu commented 2 years ago

Please don't. The preprocessing script is lost. Use of WMT14 en-de is discouraged. https://bricksdont.github.io/posts/2020/12/using-old-data/

I'm trying to help by way of rejecting papers with poor experimental practices.

thammegowda commented 2 years ago

Sorry, I read your comment after making a pull request.

I was going through a paper published at IWSLT21 that used the IWSLT15 en-vi preprocessed dataset obtained from this link, and I was interested in having a fair comparison with it.

kpu commented 2 years ago

There is no such thing as a fair comparison with an existing paper that edits the reference by tokenizing and compound splitting in ways that are usually not documented in the paper. https://github.com/pytorch/fairseq/issues/346

thammegowda commented 2 years ago

Oh yes, I had also spent a few weeks trying to reproduce the results of the 'Attention is all you need' paper using my own code, and learned it the hard way.

This time I tried to compare some experiments with https://aclanthology.org/2021.iwslt-1.33/ ; now I see they have used multi-bleu.perl to get tokenized BLEU! No need for "fair comparison" in this scenario.

Thanks for this discussion; very helpful.