microsoft / MASS

MASS: Masked Sequence to Sequence Pre-training for Language Generation
https://arxiv.org/pdf/1905.02450.pdf
Other
1.12k stars 206 forks source link

Confusion regarding data #164

Open kr-sundaram opened 4 years ago

kr-sundaram commented 4 years ago

Thank you very much for making the repository public!

I have one confusion regarding the train and valid file in mono and para directory for NMT task for model pre-training and fine-tuning tasks.

As stated in the README file. I understand dict.en.txt and dict.zh.txt should be exact same in both mono and para directory. And in para directory bilingual data should be there in order to fine-tune the model for fine-tune task. The confusion i have is basically for mono directory and number of examples it should contain for both the languages in their respective train and valid files.

Whether number of sentences and the sentences itself in both languages can differ for mono directory, right? I mean it should not matter if one uses, lets say, 100 sentences for en and 200 sentences for zh as they are just bunch of monolingual data.

The only point to note that is both mono and para directory should share same dictionary files, right?

StillKeepTry commented 4 years ago

Yes. It does not matter for the number of sentences for monolingual data. And for each language, it shares the same dictionary.