WMT14 En-de Dataset and decoding result

zhangjcqq commented 6 years ago

Description

I'm trying to reproduce the En-De experiment in the paper "Attention is all you need". While, I'm confused by training data. The paper used the WMT14 training data, while the following URL seems link to the WMT16 training data. https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8 My questions are:

Are the above downloading data and the described data in the paper identical ?
Can I directly use the training data and the BPE vocabulary in the downloaded file to reproduce the paper experiment ?
Can I find the decoding results that has the equal or approximate BLEU score(may be 27.3 or 28) reported in the paper somewhere?
What is the specific scripts that used to get the BLEU score in the paper ?

Thanks!

martinpopel commented 6 years ago

I have not inspected the google drive data, but if it is really the en-de training data from WMT16 than the only difference from WMT14 is the NewsCommentary corpus v11 vs v9. However, Europarl and CommonCrawl (which are much bigger) are the same. Although it is common to evaluate new MT systems (trained on new WMT training data) on old test sets, I agree it is not fully comparable - NewsCommentaries may contain documents with the same topics as in the newstest2014.
You should distinguish between TranslateEndeWmtBpe32k (legacy setup using original Rico's BPE) and TranslateEndeWmt32k (using T2T internal subwords aka word-pieces). The "Attention is all you need" paper used the former for en-de, but I am not sure if anyone is using this anymore. Word-pieces are reportedly better than BPE.
see https://github.com/tensorflow/tensor2tensor/issues/317#issuecomment-400610990 (note that these are using T2T word-pieces, not BPE.)
see https://github.com/tensorflow/tensor2tensor/issues/317#issuecomment-380970191 (but I suggest to use sacreBLEU instead of trying to reproduce the hacky way of evaluation)

zhangjcqq commented 6 years ago

@martinpopel Thanks for your nearly instant reply. I still have some question to discuss.

I agree with you that it is not rigorous to use that 'wmt16' dataset to reproduce the paper experiment. I noticed the Stanford NLP Group published their preprocessed wmt14 dataset without BPE vocabulary. While, I think using that dataset to reproduce the experiment is also unreasonable for the tokenization and cleaning differences. So, have the authors of "Attention is all you need" released their wmt14 En-De training data and vocabulary somewhere ? I really want to use that.
I know that "TranslateEndeWmtBpe32k" just load the outer-extracted BPE vocabulary, that's why I want to get the original BPE vocabulary used for the paper. However, I don't think, the internal subword extraction algorithm in "TranslateEndeWmt32k" is the word piece approach. As far as I can see, it's the BPE algorithm with binary search to determine the approximate BPE vocabulary size. Maybe, @lukaszkaiser can figure it out clearly. Here is the wordpiece model(https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/37842.pdf)
I want to reproduce the experiment result on the BPE subwords. :-)
I often use "multi-bleu.perl" to compute the BLEU score, while it depends on the tokenization. I want to know the doubtless answer about the scorer used in the paper. Sure about tweaked reference + get_ende_bleu.sh ?

martinpopel commented 6 years ago

I am not sure what exact training data and BPE were used for "Attention is all you need". My impression was that the google drive file you linked (and which is in the code) is exactly what was used in the paper. I would guess that the effect of NewsCommentary v9 vs v11 is not big in the end (as all the training data was shuffled uniformly), but without an experiment we cannot be sure.
In "Attention is all you need", they call it "word-piece dictionary" and cite the GNMT paper where they write "we adopt the wordpiece model (WPM) implementation initially developed to solve a Japanese/Korean" and cite the paper you linked. I think there are small differences in these three implementations, but they are still more similar to each other than to BPE. See also https://github.com/tensorflow/tensor2tensor/issues/906#issuecomment-401739029 and Macháček et al. (2018).

zhangjcqq commented 6 years ago

Yes, I would take experiments on the "wmt16" dataset. While, I want to get an official answer about the dataset issue :-).
The En-Fr experiment used the word piece approach, but is it identical to the subword extraction codes in the project? Also, waiting for the official answer. :-) Many thanks to @martinpopel , you really are an expert on the Tensor2Tensor project. -:)

zhangjcqq commented 6 years ago

@martinpopel I met a new issue about the GPU OOM(out of the memory). I trained the problem "translate_ende_wmt_bpe32k", hparams_set is "transformer_base", in which the default batch_size is 4096. when I used 1 GPU, the training procedure was well. However, after I set the worker_gpu=8, the OOM issue occurred. Under the "worker_gpu=8", only "batch_size=1024", the training procedure is OK. Have you met the issue before ? Could you offer some possible reasons ?

martinpopel commented 6 years ago

In my experiments, the maximum batch size without OOM was almost the same for 1 GPU and 8 GPUs.

BTW: GitHub issues should stay focused on a single topic in order to be helpful for other users. I think you can close this issue and possibly open a new one (after checking it was not already reported).

zhangjcqq commented 6 years ago

ok, got it ! Thanks very much!

tensorflow / tensor2tensor

WMT14 En-de Dataset and decoding result #939

Description