microsoft / MASS

MASS: Masked Sequence to Sequence Pre-training for Language Generation
https://arxiv.org/pdf/1905.02450.pdf
Other
1.11k stars 206 forks source link

Confusion about the amount of monolingual data used in the experiments #160

Open cbaziotis opened 4 years ago

cbaziotis commented 4 years ago

Hi,

I would like to ask what is the amount of monolingual data used in each experiment.

  1. In the paper, as well as in this issue, you mention that you use...

    ... all of the monolingual data from WMT News Crawl datasets, which covers 190M, 62M and 270M sentences from the year 2007 to 2017 for English, French, German respectively.

  2. On these issues: https://github.com/microsoft/MASS/issues/32#issuecomment-549147762 https://github.com/microsoft/MASS/issues/62#issuecomment-589016149, you mention that you use a subsample (50 million sentences) of the full data.
    1. In get-data-nmt.sh I see that you have commented out the download links to the News Crawl data from many years for each language.

I may have missed something or misread the issues, but I am confused about how much data you actually used. I would appreciate it if you helped clear my confusion.

Thanks!

nxphi47 commented 3 years ago

Same issue. Unable to tell which data is being used to reproduced the experiments. Can you please exactly specify how did you created the data for pre-training and fine-tuning, for en-fr, de-en and en-ro? Thank you a lot. @StillKeepTry