microsoft / MASS

MASS: Masked Sequence to Sequence Pre-training for Language Generation
https://arxiv.org/pdf/1905.02450.pdf
Other
1.11k stars 206 forks source link

Experiment setting for Multilingual pretraining and Supervised NMT #150

Closed renziver closed 4 years ago

renziver commented 4 years ago

Hi, I'm trying to set up a multilingual LM pretraining and supervised NMT. I'm having a problem with following the sample code for pretraining and fine tuning. I'm planning to build a single NMT model for EN-TL and EN-CEB pairs.

I have the following data following the MASS-supNMT docs. `

and I have the following code for pretraining

fairseq-train $data_dir --user-dir $user_dir --save-dir $save_dir --task xmasked_seq2seq --source-langs ceb,en,tl --target-langs ceb,en,tl --langs ceb,en,tl --arch xtransformer --mass_steps ceb-ceb,en-en,tl-tl --memt_steps en-ceb, en-tl --optimizer adam --adam-betas '(0.9,0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --lr 0.00005 --min-lr 1e-09 --criterion label_smoothed_cross_entropy --max-tokens 4096 --max-update 100000 --max-epoch 10 \ --dropout 0.1 --relu-dropout 0.1 --attention-dropout 0.1 --share-decoder-input-output-embed \ --valid-lang-pairs en-ceb, en-tl --word_mask 0.3 \ --ddp-backend=no_c10d

However, I keep on getting this error:

raise FileNotFoundError('Not Found available {}-{} para dataset for ({}) lang'.format(split, key, src)) FileNotFoundError: Not Found available valid-ceb-en para dataset for (ceb) lang I tried to create a copy of datasets following valid-ceb-en format but the error is still occurring.

I hope someone can help me on setting up an experiment for a multilingual setting.

MichaelCaohn commented 3 years ago

Hi, I am wondering how have you solved this issue?

I just encountered the same problem as yours.