Closed thammegowda closed 2 years ago
Please don't. The preprocessing script is lost. Use of WMT14 en-de is discouraged. https://bricksdont.github.io/posts/2020/12/using-old-data/
I'm trying to help by way of rejecting papers with poor experimental practices.
Sorry, I read your comment after making a pull request.
I was going through a paper published at IWSLT21 that used the IWSLT15 en-vi preprocessed dataset obtained from this link, and I was interested in having a fair comparison with it.
There is no such thing as a fair comparison with an existing paper that edits the reference by tokenizing and compound splitting in ways that are usually not documented in the paper. https://github.com/pytorch/fairseq/issues/346
Oh yes, I had also spent a few weeks trying to reproduce the results of the 'Attention is all you need' paper using my own code, and learned it the hard way.
This time I tried to compare some experiments with https://aclanthology.org/2021.iwslt-1.33/ ; now I see they have used multi-bleu.perl
to get tokenized BLEU! No need for "fair comparison" in this scenario.
Thanks for this discussion; very helpful.
https://nlp.stanford.edu/projects/nmt/
I see these datasets used in some papers. Let's add them in. Note: they are preprocessed