Reproducibility issue when training on a smaller dataset and fewer GPUs

freddy5566 commented 4 years ago

Hi:

Just want to know How to replicate the result you mentioned on README, The model reaches 20 BLEU on testing dataset, after training for only 2 epochs.

I simple used your setup to train my model, however after 3 epochs, I got

020-06-03 17:49:03 | INFO | fairseq_cli.generate | Generate test with beam = 5: BLEU4 = 0.09, 7.5/0.7/0.0/0.0 (BP=1.000, ratio=1.996, syslen=289332, reflen=144951)

my generate-script is

fairseq-generate data-bin/wmt17_zh_en \
    --path checkpoints/checkpoint_best.pt \
    --batch-size 128 --beam 5 --remove-bpe

and the training data I used are:

training-parallel-nc-v12
United Nations Parallel-enzh

Thank you!

sanxing-chen commented 4 years ago

Your evaluate script looks legit to me, this's so weird. Could you provide more details like the training loss and ppl curve? It can be drawn by the script provided in the repo.

freddy5566 commented 4 years ago

Hi @STayinloves :

Here is the result after I executed the script that you provided, besides I am not using Jupyter so I add plt.show() in the very end of file.

Figure_1

so, I also upload train.log.

Thank you again!

sanxing-chen commented 4 years ago

You might want to see if checkpoint_last.pt give you different results.

freddy5566 commented 4 years ago

I got an zero, here is the result: 2020-06-04 00:07:57 | INFO | fairseq_cli.generate | Generate test with beam=5: BLEU4 = 0.00, 5.4/0.0/0.0/0.0 (BP=0.448, ratio=0.554, syslen=80370, reflen=144951)

sanxing-chen commented 4 years ago

Your train.log says that you only have 15 examples in the validation set, this's probably wrong, I'm wondering whether the same mistake happens to the testing set.

freddy5566 commented 4 years ago

that's weird, since I download them from WMT and make sure files aren't wrong. here is how I do pre-process:

download them in ./dataset
and put those files in test/valid/train just like you, and we use the same test/valid dataset
run prepare.sh

2020-06-04 00:07:57 | INFO | fairseq_cli.generate | Translated 8037 sentences (88407 tokens) in 14.6s (551.45 sentences/s, 6065.99 tokens/s)

I think test examples are fine...

Thank you for your response

freddy5566 commented 4 years ago

Update: I re-executed the preprocess and I am able to create 1996 sentences instead of 15 examples you mentioned above.

my preprocess.log

Namespace(align_suffix=None, alignfile=None, all_gather_list_size=16384, bf16=False, bpe=None, checkpoint_suffix='', cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='data-bin/wmt17_zh_en', empty_cache_freq=0, fp16=False, fp16_init_scale=128, fp16_no_flatten_grads=False, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=100, lr_scheduler='fixed', memory_efficient_bf16=False, memory_efficient_fp16=False, min_loss_scale=0.0001, model_parallel_size=1, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer='nag', padding_factor=8, quantization_config_path=None, seed=1, source_lang='zh', srcdict=None, target_lang='en', task='translation', tensorboard_logdir='', testpref='dataset//test.32000.bpe', tgtdict=None, threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, tpu=False, trainpref='dataset//train.32000.bpe', user_dir=None, validpref='dataset//valid.32000.bpe', workers=12)
[zh] Dictionary: 36495 types
[zh] dataset//train.32000.bpe.zh: 222476 sents, 5624865 tokens, 0.0% replaced by <unk>
[zh] Dictionary: 36495 types
[zh] dataset//valid.32000.bpe.zh: 1996 sents, 58897 tokens, 0.278% replaced by <unk>
[zh] Dictionary: 36495 types
[zh] dataset//test.32000.bpe.zh: 2001 sents, 56962 tokens, 0.365% replaced by <unk>
[en] Dictionary: 31183 types
[en] dataset//train.32000.bpe.en: 222476 sents, 6106080 tokens, 0.0% replaced by <unk>
[en] Dictionary: 31183 types
[en] dataset//valid.32000.bpe.en: 1996 sents, 68078 tokens, 0.00881% replaced by <unk>
[en] Dictionary: 31183 types
[en] dataset//test.32000.bpe.en: 2001 sents, 63675 tokens, 0.00471% replaced by <unk>
Wrote preprocessed data to data-bin/wmt17_zh_en

it seems great, however, after 1 epoch training, I still got 0.15, since it is a huge difference between 20 and 0.15, just want to know, if I did something wrong, or I should be patient just wait for the result.

I upload the train.log in here, sorry for my lack of experience.

sanxing-chen commented 4 years ago

I would say just wait for one or two epochs to say, the model changes dramatically during the first few updates especially under the warmup scheduler. You can check the loss as an indicator.

I worked on this repo one year ago, I don't quite remember whether it differs by runs or seeds. But I did notice it will reach nearly a performance upper bound in the first few epochs.

There's nothing wrong with a lack of experience :)

freddy5566 commented 4 years ago

after 200,000 updates it is still 0.12, so, I guess something went wrong. maybe I'll use a smaller dataset and model to do the experiment.

but, still thank your response.

sanxing-chen commented 4 years ago

You can try the interactive command to check some model output manually, a smaller dataset is also a good starter.

freddy5566 commented 4 years ago

after changed to a smaller dataset (training-parallel-nc-v12.tgz), and it's still the same result, I guess it's something went wrong on pre-process step, and I still cannot replicate the result. Is there anything that I need to do before execute those scripts?

sanxing-chen commented 4 years ago

I just noticed a few facts that I was unaware of in our previous discussion.

The training script can be affected by the number of GPUs available since it only limits the --max-tokens per GPU. So more GPUs will lead to a larger batch size in training. I use 6 GPUs previously while you seem to use 1 GPU (--update-freq setting can be helpful in this case). It's my fault that I didn't notice this in the repo, sorry for that.

Unfortunately, I don't currently have the resource to train a model on the full dataset, but based on the observation in my little experiment on training-parallel-nc-v12.tgz today (I download and run from scratch and will update the result later) I didn't find any other steps to add to the pre-processing step. I found my old training log and will attach it here. train_wmt17_zh_en.log

I hope this helps!

sanxing-chen commented 4 years ago

Update on my experiment yesterday, I tried to train the model on training-parallel-nc-v12.tgz only (~200k examples) (I use --update-freq to ensure a similar batch size), it doesn't work. I observed the validation loss went up while the model could only output random fluent sentences. Then I switch to the full dataset (~20m examples), after one epoch (2.5 hours on 4 GTX 2080 Ti) I got BLEU4=18.89 on the testing set. So I suspect the model configuration cannot be trained on a small dataset easily.

freddy5566 commented 4 years ago

It helps a lot!!

I've tried transformer_iwslt_de_en and other models and turn out it doesn't work. so, I guess training transformer dataset is quite important, anyway, you really save my day!

sanxing-chen commented 4 years ago

Adding to the discussion about different batch sizes, according to the results on Popel and Bojar, “Training Tips for the Transformer Model.” figure 5 and 6, when training big model, small batch size can lead to failure.

freddy5566 commented 4 years ago

@STayinloves It helps a lot!! I'll try an even bigger batch size, and also thanks for your help

afaq-ahmad commented 3 years ago

@sanxing-chen Hi, can you please guide me about full dataset (~20m examples), from where I can get it. Thanks

freddy5566 commented 3 years ago

Hi @afaq-ahmad :

after half year of research and trial and error, I think if you got (~20m examples) then train a regular transformer is total cool you can follow this example if you want to train a low resource MT model, flores is another cool project that you can start with.

afaq-ahmad commented 3 years ago

Hi @afaq-ahmad :

after half year of research and trial and error, I think if you got (~20m examples) then train a regular transformer is total cool you can follow this example if you want to train a low resource MT model, flores is another cool project that you can start with.

Thanks alot. I have 24 million sentences but when I train the method here example it's taking 12 hours for 1 epoch and only 0.2 points of blue score increase. It looks like it will take 30 days for training and reaching around 20 blue score. Do you have any idea how can I fast the procedure, I am using these parameters:

!CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_zh \ --arch transformer --share-decoder-input-output-embed \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ --dropout 0.2 --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --max-tokens 8192 \ --eval-bleu \ --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \ --eval-bleu-detok moses \ --eval-bleu-remove-bpe --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --save-dir checkpoints/transformer

freddy5566 commented 3 years ago

Hi @afaq-ahmad : after half year of research and trial and error, I think if you got (~20m examples) then train a regular transformer is total cool you can follow this example if you want to train a low resource MT model, flores is another cool project that you can start with.

Thanks alot. I have 24 million sentences but when I train the method here example it's taking 12 hours for 1 epoch and only 0.2 points of blue score increase. It looks like it will take 30 days for training and reaching around 20 blue score. Do you have any idea how can I fast the procedure, I am using these parameters:

!CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/wmt17_en_zh --arch transformer --share-decoder-input-output-embed --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr 0.001 --lr-scheduler inverse_sqrt --warmup-updates 4000 --dropout 0.2 --weight-decay 0.0001 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 8192 --eval-bleu --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' --eval-bleu-detok moses --eval-bleu-remove-bpe --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --save-dir checkpoints/transformer

You can leverage --fp16, --max-tokens normally we set --max-tokens to be 4k or 3k I also noticed that you didn't use --update-freq since you are using one gpu for training, you need to set it to be 4

kkeleve commented 2 years ago

I only have a 1.05m sentences. How much can I adjust the batchsize or other parameters to achieve good results?The following are my training parameters and bleu values CUDA_VISIBLE_DEVICES=0 nohup fairseq-train ${data_dir}/data-bin \ -a transformer --optimizer adam --source-lang ${src} --target-lang ${tgt} \ --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 \ --lr-scheduler inverse_sqrt --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy --max-update 200000 \ --warmup-updates 10000 --warmup-init-lr '1e-7' --lr '0.001' \ --adam-betas '(0.9, 0.98)' --adam-eps '1e-09' --clip-norm 25.0 \ --update-freq 4 --max-epoch 25 \ --tensorboard-logdir ~/nmt/log/tensorboardlog_tc4 \ --keep-last-epochs 2 --save-dir ${model_dir}/checkpoints_tc4 > ~/nmt/log/train_tc4.log 2>&1 &

kkeleve commented 2 years ago

BLEU = 21.13, 55.6/27.2/15.2/9.0 (BP=0.992, ratio=0.992, hyp_len=549536, ref_len=553932)

freddy5566 commented 2 years ago

Hi @sunyi1123,

You can play around warmup-updates, label-smoothing, and dropout. You can also apply a skill called "back translation". You firstly train a reverse-side MT model and use this trained model to translate rever-side sentences. This way, you will end up with 2x data.

sanxing-chen / NMT2017-ZH-EN

Reproducibility issue when training on a smaller dataset and fewer GPUs #3