setups for reproducing IWSLT14 De-en

BaohaoLiao commented 4 years ago

Hi,

I want to reproduce your result on IWSLT14 De-En, but I can't get 35.78. My best result is 34.25. Here I want to ask some detailed setup:

Do you use share embedding? I don't use. If yes, how about your size of vocabulary.
For language model, I use python ~/fairseq/train.py \ ~/de2en/lmofde \ --task language_modeling \ --arch transformer_lm_iwslt \ --optimizer adam \ --adam-betas '(0.9, 0.98)' \ --clip-norm 0.0 \ --lr-scheduler inverse_sqrt \ --warmup-init-lr 1e-07 \ --warmup-updates 4000 \ --lr 0.0005 \ --min-lr 1e-09 \ --dropout 0.1 \ --weight-decay 0.0 \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --max-tokens 4096 \ --tokens-per-sample 4096 \ --save-dir $dir \ --update-freq 16 \ --no-epoch-checkpoints \ --log-format simple \ --log-interval 1000 \ for both De and En language model. I train the language model until convergence and use the best checkpoint for NMT. Do you have suggestion for my settings?
For NMT, I use python ~/SCA/train.py \ $DATA_PATH \ --task lm_translation \ --arch transformer_iwslt_de_en \ --optimizer adam \ --adam-betas '(0.9, 0.98)' \ --clip-norm 0.0 \ --lr-scheduler inverse_sqrt \ --warmup-init-lr 1e-07 \ --warmup-updates 4000 \ --lr 0.0009 \ --min-lr 1e-09 \ --dropout 0.3 \ --weight-decay 0.0 \ --criterion label_smoothed_cross_entropy \ --label-smoothing 0.1 \ --max-tokens 2048 \ --update-freq 2 \ --save-dir $SAVE_DIR \ --tradeoff $i \ --load-lm \ --seed 200 \ --no-epoch-checkpoints \ --log-format simple \ --log-interval 1000 for i (tradeoff), I use 0.1, 0.15 and 0.2. The best result is got by 0.15. When you calculate BLEU score, do you use best checkpoint or average checkpoint (average how many epoch's checkpoints). Do you also have other suggestions?

BaohaoLiao commented 4 years ago

Hi,

I want to reproduce your result on IWSLT14 De-En, but I can't get 35.78. My best result is 34.25. Here I want to ask some detailed setup:

1. Do you use share embedding? I don't use. If yes, how about your size of vocabulary.

2. For language model, I use
   python ~/fairseq/train.py 
   ~/de2en/lmofde 
   --task language_modeling 
   --arch transformer_lm_iwslt 
   --optimizer adam 
   --adam-betas '(0.9, 0.98)' 
   --clip-norm 0.0 
   --lr-scheduler inverse_sqrt 
   --warmup-init-lr 1e-07 
   --warmup-updates 4000 
   --lr 0.0005 
   --min-lr 1e-09 
   --dropout 0.1 
   --weight-decay 0.0 
   --criterion label_smoothed_cross_entropy 
   --label-smoothing 0.1 
   --max-tokens 4096  
   --tokens-per-sample 4096  
   --save-dir $dir 
   --update-freq 16 
   --no-epoch-checkpoints 
   --log-format simple 
   --log-interval 1000 
   for both De and En language model. I train the language model until convergence and use the best checkpoint for NMT. Do you have suggestion for my settings?

3. For NMT, I use
   python ~/SCA/train.py 
   $DATA_PATH 
   --task lm_translation 
   --arch transformer_iwslt_de_en 
   --optimizer adam 
   --adam-betas '(0.9, 0.98)' 
   --clip-norm 0.0 
   --lr-scheduler inverse_sqrt 
   --warmup-init-lr 1e-07 
   --warmup-updates 4000 
   --lr 0.0009 
   --min-lr 1e-09 
   --dropout 0.3 
   --weight-decay 0.0 
   --criterion label_smoothed_cross_entropy 
   --label-smoothing 0.1 
   --max-tokens 2048 
   --update-freq  2 
   --save-dir $SAVE_DIR 
   --tradeoff $i 
   --load-lm 
   --seed 200 
   --no-epoch-checkpoints 
   --log-format simple 
   --log-interval 1000
   for i (tradeoff),  I use 0.1, 0.15 and 0.2. The best result is got by 0.15. When you calculate BLEU score, do you use best checkpoint or average checkpoint (average how many epoch's checkpoints). Do you also have other suggestions?

By the way, I use 1 GPU. How many GPUs do you use for IWSLT14 DE-EN and WMT14 EN-DE, respectively? I need to make sure we use the same batch size by setting update-freq.

teslacool commented 4 years ago

yes, I used "share-all-embedding" and this vocabulary size is 10000 (you can see my paper for details).

I also notice that you have changed your learning rate. I use the default arguments in example/translation/readme.

I just use one gpu for iwslt and 4 gpu for wmt.

I did not do average checkpoint.

BaohaoLiao commented 4 years ago

yes, I used "share-all-embedding" and this vocabulary size is 10000 (you can see my paper for details).

I also notice that you have changed your learning rate. I use the default arguments in example/translation/readme.

I just use one gpu for iwslt and 4 gpu for wmt.

I did not do average checkpoint.

I can reproduce the result now. Thank you very much.

1024er commented 4 years ago

yes, I used "share-all-embedding" and this vocabulary size is 10000 (you can see my paper for details). I also notice that you have changed your learning rate. I use the default arguments in example/translation/readme. I just use one gpu for iwslt and 4 gpu for wmt. I did not do average checkpoint.

I can reproduce the result now. Thank you very much.

May I know how long does it cost to train the LM on 1 gpu? And to train the NMT on 4 gpus? Thank you !

teslacool / SCA

setups for reproducing IWSLT14 De-en #16