google-research / multilingual-t5

Apache License 2.0
1.25k stars 129 forks source link

pre-training sample dateset for mT5 #89

Open kaushal0494 opened 3 years ago

kaushal0494 commented 3 years ago

Hi, Thank you for the great work. I am curious how the pre-training sample looks like across different languages. If possible please provide a sample dataset. If you can point me to pre-processing (for pre-training) and pre-training scripts. It will be a great help.

StephennFernandes commented 2 years ago

hey there, were you able to find the pre-processing code that samples multi-linugal datasets for mT5 ?