sacmehta / delight

DeLighT: Very Deep and Light-Weight Transformers
MIT License
466 stars 53 forks source link

Failed to reimplement the exps on iwslt'14 de-en #8

Open CheerM opened 3 years ago

CheerM commented 3 years ago

Hi, I got some issues with the reimplementation of models trained on iwslt'14 de-en. The hyper-parameters of DeLighT(d_m=512) were set as https://github.com/pytorch/fairseq/blob/master/examples/translation/README.md, like CUDA_VISIBLE_DEVICES=0 fairseq-train \ data-bin/iwslt14.tokenized.de-en \ --arch transformer_iwslt_de_en --share-decoder-input-output-embed \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ --dropout 0.3 --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-tokens 4096 \ --eval-bleu \ --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \ --eval-bleu-detok moses \ --eval-bleu-remove-bpe \ --eval-bleu-print-samples \ --best-checkpoint-metric bleu --maximize-best-checkpoint-metric

Also come with the settings for DeLighT block: --delight-enc-min-depth 3 --delight-enc-max-depth 9 --delight-enc-width-mult 1 --delight-dec-min-depth 3 --delight-dec-max-depth 9 --delight-dec-width-mult 1

However, the resulting model got 31.2 on BLEU, much worse than the performance (35.3) reported on the manuscript. Further, the parameters of it did not match with 30M, it's 33M totally.

The hyper-parameters need to be corrected I guess, is there anyone got the same issues?

sacmehta commented 3 years ago

There could be many reasons. Try following:

i) Dropout is too high. In DeLighT. We adjust it based on model dimension within the code, so please do not pass this argument. https://github.com/sacmehta/delight/blob/cc499c53087cd248ee7a0d0b0e70c507e670cba3/fairseq/models/delight_transformer.py#L1302

ii) Learning rate is low. Transformers are unstable at higher LR, but not DeLighT. Try a higher learning rate (say 0.0009).

The number of parameters will vary with model dimension, What is the model dimension that you are using? 384?

CheerM commented 3 years ago

Thank you for the reply.

For the standard transformer block, the model (w/ 42M params) was trained on the iwslt14 de-en dataset and got 34.6 BLEU.

For the DeLighT models, I tested, the num of parameters would be 20M for dm=384, 33M for dm=512, and 50M for dm=640. So, all the experiments were trained with dm=512.

I tried to retrain the model (dm=512) with the followed command: CUDA_VISIBLE_DEVICES=0 pythyon train.py \ data-bin/iwslt14.tokenized.de-en \ --arch transformer_iwslt_de_en --share-decoder-input-output-embed \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --update-freq 1 \ --lr 0.0009 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-7 --min-lr 1e-9 \ --max-tokens 4096 \ --max-update 50000 \ --delight-emb-map-dim 128 --delight-emb-out-dim 512 \ --delight-enc-min-depth 3 --delight-enc-max-depth 9 --delight-enc-width-mult 1 \ --delight-dec-min-depth 3 --delight-dec-max-depth 9 --delight-dec-width-mult 1

and got 31.6 BLEU, almost the same as the last trial. I was wondering whether there are different settings for it?

sacmehta commented 3 years ago

I do not have the settings handy for IWSLT dataset. I will re-run the experiment and then update you. Meanwhile, could you try WMT16 En-Ro dataset?

sacmehta commented 3 years ago

Are you using the correct architecture?

It should be delight_transformer_iwslt_de_en instead of --arch transformer_iwslt_de_en

CheerM commented 3 years ago

Yes, I set the architecture as --arch delight_transformer_iwslt_de_en for all exps, it's a typo in the last comment. Sure, I'll give it a try on WMT16 En-Ro dataset.

sacmehta commented 3 years ago

Could you also share your complete log file?

CheerM commented 3 years ago

Sorry, the log file couldn't be copied from the server I used.

I modified the script 'nmt_wmt16_en2ro.py' to train the model on iwslt14 de-en, TESTED_DIMS was set as [128, 256, 384, 512], max_lr was removed from the final command, as mentioned above, the command looks like: command = ['python train.py {} --arch delight_transformer_iwslt_de_en ' '--share-decoder-input-output-embed ' '--optimizer adam --adam-betas \'(0.9, 0.98)\' --clip-norm 0.0 ' '--weight-decay 0.0 ' '--criterion label_smoothed_cross_entropy --label-smoothing 0.1 ' '--update-freq {} --keep-last-epochs 10 ' '--max-tokens {} ' '--max-update {} --warmup-updates {} ' '--lr 0.0009 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-7 --min-lr 1e-9 ' '--save-dir {} ' '--distributed-world-size {} --distributed-port 50786 ' '--delight-emb-map-dim 128 --delight-emb-out-dim {} ' '--delight-enc-min-depth 3 --delight-enc-max-depth 9 --delight-enc-width-mult 1 ' '--delight-dec-min-depth 3 --delight-dec-max-depth 9 --delight-dec-width-mult 1 ' '| tee -a {}'.format(data_dir, update_freq, max_tokens, max_update, warmup_update, results_dir, num_gpus, d_m, log_file )] Also, --d-m 512 --max-updates 50000 --warmup-updates 4000 --max-tokens 4096 --update-freq 1 --num-gpus 1

The remaining part just the same as the original setting. I tried to set the weight decay as 0 and 0.0001, but it doesn't seem to influence the result.

sacmehta commented 3 years ago

I think the issue is with the learning rate. I do not have access to the commands that I used for the paper, so ran an experiment to replicate it. If you adjust the learning rate, then you should be able to replicate the experiment. I do not have access to machines now, but as and when I get it, I will run it at a higher learning rate and see how it goes. If I remember correctly, I used Lr=0.005 for model dimension of 384. Note that at such high LRs, Transformer is unstable.

Below is the command that I used for training

python train.py data-bin/iwslt14.tokenized.de-en --arch delight_transformer_iwslt_de_en --no-progress-bar --source-lang de --target-lang en --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --update-freq 1 --keep-last-epochs 10 --ddp-backend=no_c10d --max-tokens 4096 --max-update 50000 --warmup-updates 4000 --lr-scheduler linear --warmup-init-lr 1e-7 --lr 0.001 --min-lr 1e-9 --save-dir results_iwslt_384 --delight-emb-map-dim 128 --delight-emb-out-dim 384 --delight-enc-min-depth 3 --delight-enc-max-depth 9 --delight-enc-width-mult 1 --delight-dec-min-depth 3 --delight-dec-max-depth 9 --delight-dec-width-mult 1 --share-decoder-input-output-embed --log-interval 500

Command to average checkpoints:

 python scripts/average_checkpoints.py --inputs results_iwslt_384/ --num-epoch-checkpoints 5 --output results_iwslt_384/checkpoint_avg.pt

Command to evaluate the model:

CUDA_VISIBLE_DEVICES=0 python -W ignore generate.py data-bin/iwslt14.tokenized.de-en --path results_iwslt_384/checkpoint_avg.pt --batch-size 128 --beam 5 --remove-bpe --lenpen 1 --gen-subset test --quiet

With the above settings, I got a BLEU score of 32.4 with 19.86 M parameters.

These are the train logs that I got and they suggest that model is under-fitting. Definitely, higher learning rate will improve the score further and you can get around 33+ with 19 M parameters.

2021-04-01 17:35:39 | INFO | train | epoch 046 | loss 4.035 | nll_loss 2.565 | ppl 5.917 | wps 5789.9 | ups 1.63 | wpb 3541.6 | bsz 140.5 | num_updates 50000 | lr 1e-07 | gnorm 0.762 | clip 0 | oom 0 | train_wall 264 | wall 29146
2021-04-01 17:35:47 | INFO | valid | epoch 046 | valid on 'valid' subset | loss 4.122 | nll_loss 2.569 | ppl 5.934 | wps 21562.1 | wpb 2881 | bsz 117.5 | num_updates 50000 | best_loss 4.122
2021-04-01 17:35:49 | INFO | fairseq.checkpoint_utils | saved checkpoint results_iwslt_384/checkpoint_best.pt (epoch 46 @ 50000 updates, score 4.122) (writing took 1.512 seconds)
2021-04-01 17:35:49 | INFO | fairseq_cli.train | done training in 29155.0 seconds
CheerM commented 3 years ago

I tried to rerun the exps with the same arguments as you set. LR was set as 0.001 and 0.005 and retrained the models on iwslt14' de-en. The following results show that 0.005 may too large for convergence. Model Params BLEU delight(d_m=384) 20M 32.0 lr=0.001 delight(d_m=384) 20M 15.3 lr=0.005 delight(d_m=512) 33M 31.6 lr=0.001 delight(d_m=512) 33M 0.7 lr=0.005

What's more, the performance of models on WMT16' en-ro are shown as followed: Model Params BLEU Transformer 62M 33.9 delight(d_m=128) 7M 31.7
delight(d_m=256) 13M 33.7
delight(d_m=384) 22M 34.3
delight(d_m=512) 53M 34.4

sacmehta commented 3 years ago

Good to see that you are able to replicate the Results on the WMT'16 En-Ro dataset.

https://github.com/sacmehta/delight/blob/master/readme_files/nmt/wmt16_en2ro.md#results

It seems that some setting is off for the IWSLT'14 De-En experiment and I need to look for it. I do not have any GPUs right now. I will check it when I get access to the machines.

Meanwhile, I suggest you check the WMT'14 En-De and the WMT'14 En-Fr datasets.

Thanks for your interest in our work.