huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.18k stars 2.68k forks source link

New Datasets: IWSLT15+, ITTB #438

Open sshleifer opened 4 years ago

sshleifer commented 4 years ago

Links: iwslt Don't know if that link is up to date.

ittb Motivation: replicate mbart finetuning results (table below) image

For future readers, we already have the following language pairs in the wmt namespaces:

wmt14: ['cs-en', 'de-en', 'fr-en', 'hi-en', 'ru-en']
wmt15: ['cs-en', 'de-en', 'fi-en', 'fr-en', 'ru-en']
wmt16: ['cs-en', 'de-en', 'fi-en', 'ro-en', 'ru-en', 'tr-en']
wmt17: ['cs-en', 'de-en', 'fi-en', 'lv-en', 'ru-en', 'tr-en', 'zh-en']
wmt18: ['cs-en', 'de-en', 'et-en', 'fi-en', 'kk-en', 'ru-en', 'tr-en', 'zh-en']
wmt19: ['cs-en', 'de-en', 'fi-en', 'gu-en', 'kk-en', 'lt-en', 'ru-en', 'zh-en', 'fr-de']
thomwolf commented 4 years ago

Thanks Sam, we now have a very detailed tutorial and template on how to add a new dataset to the library. It typically take 1-2 hours to add one. Do you want to give it a try ? The tutorial on writing a new dataset loading script is here: https://huggingface.co/nlp/add_dataset.html And the part on how to share a new dataset is here: https://huggingface.co/nlp/share_dataset.html

mariamabarham commented 4 years ago

Hi @sshleifer, I'm trying to add IWSLT using the link you provided but the download urls are not working. Only [en, de] pair is working. For others language pairs it throws a 404 error.