Confusion regarding data

Thank you very much for making the repository public!

I have one confusion regarding the train and valid file in mono and para directory for NMT task for model pre-training and fine-tuning tasks.

As stated in the README file. I understand dict.en.txt and dict.zh.txt should be exact same in both mono and para directory. And in para directory bilingual data should be there in order to fine-tune the model for fine-tune task. The confusion i have is basically for mono directory and number of examples it should contain for both the languages in their respective train and valid files.

Whether number of sentences and the sentences itself in both languages can differ for mono directory, right? I mean it should not matter if one uses, lets say, 100 sentences for en and 200 sentences for zh as they are just bunch of monolingual data.

The only point to note that is both mono and para directory should share same dictionary files, right?

microsoft / MASS

Confusion regarding data #164