Closed NonvolatileMemory closed 3 years ago
Thanks for your interest!
For AT model, we train the model around 100k steps. For NAT model, generally the longer the better because the mask-predict model converges slower, and we report our score from the model trained for 500k steps. All with max-tokens=2048,update-freq=2
and 8 GPUs, and the training shell is the same as in IWSLT14 De-En, except that we use bert-base-cased
for English.
Thanks for the reply,
Do I need to tune the arch to "transformer_nat_ymask_bert_two_adapter_wmt_en_de_bert_base" ? Additionally, I try your shell on iwslt and get some good results over iwslt, however, for wmt 14 de-en, I only got 29+ bleu. Which seems far away from the paper.
Here is my shell, would you like to tell me which setting is wrong? I train this model over 8 titan v100 with max-tokens=8192
python3 train.py $DATA_DIR \
--task bert_xymasked_wp_seq2seq -s de -t en \
-a transformer_nat_ymask_bert_two_adapter_wmt_en_de_bert_base \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr '1e-07' \
--lr 5e-4 --min-lr '1e-09' \
--criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 \
--weight-decay 0.0 --max-tokens 8192 --update-freq 1 --max-update 300000 \
--left-pad-source False --adapter-dimension 512 \
--use-adapter-bert -- --decoder-bert-model-name ${DISK}/bert-base-cased --bert-model-name ${DISK}/bert-base-german-cased \
--keep-last-epochs 5 --save-dir ${model_dir} --ddp-backend no_c10d --max-source-positions 500 --max-target-positions 500 --fp16 \
The setting seems correct, I think the problem is on the training set. As stated in the "Inference and Evaluation" section of our paper, for WMT14 En-De/De-En translation, we use the training set generated by sequence-level knowledge distillation. Mask-Predict does not perform well on the raw training dataset for this task.
In this issue the authors of the DisCo paper have provided the distilled data they used, maybe you can have a try.
You can also try a larger learning rate (such as 7e-4) which is usually beneficial for large batchsize.
Thanks for the reply,
The dataset I use is the distilled data from disco paper. So it seems very confused.
Emmm, how you preprocess the dataset?
Basically, I use the file tokenization.py
under the folder "bert" in your code.
Like this:
text_file_path = disk_path + "valid.en"
save_file_path = disk_path + "valid.wordpiece.en"
vocab_file_path = disk_path + "en.dict"
text_file = open(text_file_path, 'r')
save_file = open(save_file_path, 'w')
tok = tokenizer(vocab_file_path, do_lower_case=False)
for line in text_file:
line = line.strip()
tok_line = " ".join(tok.tokenize(line))
save_file.write(tok_line)
save_file.write('\n')
And then I use preprocess.py to convert them into binary for fairseq, seems like only 0.1% is setted to be [unk] for testing set.
Addtionally, for wmt data, there are around 40 samples over long than 512 (the max len in bert), so I think we have to set --max-source-positions 500 --max-target-positions 500
which is also different from IWSLT command.
I see. I re-checked your command and find that we set --adapter-dimension
to 2048 instead of 512 for WMT tasks, maybe thats the point.
Aha, thank you!
But I also recommend you also share your training commands over wmt datasets~
Below is our script for WMT14 En-De translation. For De-En, just exchange bert-base-cased
and bert-base-german-cased
.
python $HOME/train.py $DATA_DIR \
--task bert_xymasked_wp_seq2seq \
-a transformer_nat_ymask_bert_two_adapter_wmt_en_de_bert_base \
--optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
--lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr '1e-07' \
--lr 0.0007 --min-lr '1e-09' \
--criterion label_smoothed_length_cross_entropy --label-smoothing 0.1 \
--weight-decay 0.0 --max-tokens 2048 --update-freq 2 \
--no-progress-bar --log-format json --log-interval 100 \
--max-update 500000 --save-dir $MODEL_DIR --left-pad-source False \
--use-adapter-bert --decoder-bert-model-name bert-base-cased --bert-model-name bert-base-german-cased
Hi Junliang,
I train this shell command for 100 epochs, but only got sacre bleu = 24.5 (BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+test.wmt14/full+tok.13a+version.1.4.2 = 24.5 ) at WMT 14 en de. Which is relatively lower than the original mask-predict.
Would you like to share your checkpoints over WMT tasks?
Thanks
Hi, currently I cannot get access to the old checkpoints. But for your convenience, I re-train the model on the WMT14 De-En task, and get 33.39 BLEU score after 150k training steps. Please refer to this link for the checkpoint. Training longer will get better results.
We use the average of last 10 checkpoints and directly report the tokenized BLEU score provided by fairseq Generate test with beam=4: BLEU = 33.39, 66.7/41.2/27.6/18.8 (BP=0.966, ratio=0.966, hyp_len=67656, ref_len=70005)
, and the script is identical to that in IWSLT14 De-En. Please re-check your pipeline carefully.
I try your ckpt but only got this result:
| Generate test with beam=4: BLEU4 = 30.27, 65.7/39.8/26.2/17.6 (BP=0.914, ratio=0.918, syslen=67696, reflen=73759)
And I got some results like this
T-2911 The throwaway society does not think H-2911 0.0 The Disposable Society does Not Think
I thought maybe my data is wrong? Would you like to share your tokenized data with fairseq dict?
Hi Junliang,
Thanks for your nice code.
Could you tell me how much epochs you training for wmt 14 tasks, and would you like to share your training command shell for these tasks?
Thank you.