[GEM] add WikiLingua cross-lingual abstractive summarization dataset

huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

https://huggingface.co/docs/datasets

Apache License 2.0

19.13k stars 2.66k forks source link

[GEM] add WikiLingua cross-lingual abstractive summarization dataset #834

Closed yjernite closed 3 years ago

yjernite commented 3 years ago

Adding a Dataset

Name: WikiLingua
Description: The dataset includes ~770k article and summary pairs in 18 languages from WikiHow. The gold-standard article-summary alignments across languages were extracted by aligning the images that are used to describe each how-to step in an article.
Paper: https://arxiv.org/pdf/2010.03093.pdf
Data: https://github.com/esdurmus/Wikilingua
Motivation: Included in the GEM shared task. Multilingual.

Instructions to add a new dataset can be found here.

KMFODA commented 3 years ago

Hey @yjernite. This is a very interesting dataset. Would love to work on adding it but I see that the link to the data is to a gdrive folder. Can I just confirm wether dlmanager can handle gdrive urls or would this have to be a manual dl?

yjernite commented 3 years ago

Hi @KMFODA ! A version of WikiLingua is actually already accessible in the GEM dataset

You can use it for example to load the French to English translation with:

from datasets import load_dataset
wikilingua = load_dataset("gem", "wiki_lingua_french_fr")

Closed by https://github.com/huggingface/datasets/pull/1807