Open zhangjcqq opened 6 years ago
I have not inspected the google drive data, but if it is really the en-de training data from WMT16 than the only difference from WMT14 is the NewsCommentary corpus v11 vs v9. However, Europarl and CommonCrawl (which are much bigger) are the same. Although it is common to evaluate new MT systems (trained on new WMT training data) on old test sets, I agree it is not fully comparable - NewsCommentaries may contain documents with the same topics as in the newstest2014.
You should distinguish between TranslateEndeWmtBpe32k
(legacy setup using original Rico's BPE) and TranslateEndeWmt32k
(using T2T internal subwords aka word-pieces). The "Attention is all you need" paper used the former for en-de, but I am not sure if anyone is using this anymore. Word-pieces are reportedly better than BPE.
see https://github.com/tensorflow/tensor2tensor/issues/317#issuecomment-400610990 (note that these are using T2T word-pieces, not BPE.)
see https://github.com/tensorflow/tensor2tensor/issues/317#issuecomment-380970191 (but I suggest to use sacreBLEU instead of trying to reproduce the hacky way of evaluation)
@martinpopel Thanks for your nearly instant reply. I still have some question to discuss.
I am not sure what exact training data and BPE were used for "Attention is all you need". My impression was that the google drive file you linked (and which is in the code) is exactly what was used in the paper. I would guess that the effect of NewsCommentary v9 vs v11 is not big in the end (as all the training data was shuffled uniformly), but without an experiment we cannot be sure.
In "Attention is all you need", they call it "word-piece dictionary" and cite the GNMT paper where they write "we adopt the wordpiece model (WPM) implementation initially developed to solve a Japanese/Korean" and cite the paper you linked. I think there are small differences in these three implementations, but they are still more similar to each other than to BPE. See also https://github.com/tensorflow/tensor2tensor/issues/906#issuecomment-401739029 and Macháček et al. (2018).
@martinpopel I met a new issue about the GPU OOM(out of the memory). I trained the problem "translate_ende_wmt_bpe32k", hparams_set is "transformer_base", in which the default batch_size is 4096. when I used 1 GPU, the training procedure was well. However, after I set the worker_gpu=8, the OOM issue occurred. Under the "worker_gpu=8", only "batch_size=1024", the training procedure is OK. Have you met the issue before ? Could you offer some possible reasons ?
In my experiments, the maximum batch size without OOM was almost the same for 1 GPU and 8 GPUs.
BTW: GitHub issues should stay focused on a single topic in order to be helpful for other users. I think you can close this issue and possibly open a new one (after checking it was not already reported).
ok, got it ! Thanks very much!
Description
I'm trying to reproduce the En-De experiment in the paper "Attention is all you need". While, I'm confused by training data. The paper used the WMT14 training data, while the following URL seems link to the WMT16 training data. https://drive.google.com/uc?export=download&id=0B_bZck-ksdkpM25jRUN2X2UxMm8 My questions are:
Thanks!