masakhane-io / masakhane-mt

Machine Translation for Africa
MIT License
278 stars 206 forks source link

Update notebooks to no longer rely on JW300 #199

Open cdleong opened 3 years ago

cdleong commented 3 years ago

Edit: see #200, maybe we should leave the old JW300 notebooks up, and instead create new ones

The problem

JW300 has been taken down for copyright reasons. At least the following notebooks all rely on it:

https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_from_English_training.ipynb https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_gdrive_from_English.ipynb https://github.com/masakhane-io/masakhane-mt/blob/master/starter_notebook_into_English_training.ipynb

a solution (but see #200 )

They need to be fixed to no longer use this dataset. Perhaps we could use Tatoeba or FloRES 101? Or one of the other machine translation datasets on https://huggingface.co/datasets?task_ids=task_ids:machine-translation&sort=downloads

cdleong commented 3 years ago

Steps that need to be done:

cdleong commented 3 years ago

So for example, this section breaks because JW300 is no longer downloadable: image