huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.22k stars 2.69k forks source link

Add Belebele multiple-choice machine reading comprehension (MRC) dataset #6284

Closed rajveer43 closed 1 year ago

rajveer43 commented 1 year ago

Feature request

Belebele is a multiple-choice machine reading comprehension (MRC) dataset spanning 122 language variants. This dataset enables the evaluation of mono- and multi-lingual models in high-, medium-, and low-resource languages. Each question has four multiple-choice answers and is linked to a short passage from the FLORES-200 dataset. The human annotation procedure was carefully curated to create questions that discriminate between different levels of generalizable language comprehension and is reinforced by extensive quality checks. While all questions directly relate to the passage, the English dataset on its own proves difficult enough to challenge state-of-the-art language models. Being fully parallel, this dataset enables direct comparison of model performance across all languages. Belebele opens up new avenues for evaluating and analyzing the multilingual abilities of language models and NLP systems.

Please refer to paper for more details, The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants.

Composition

Motivation

official repo https://github.com/facebookresearch/belebele

Your contribution

-

mariosasko commented 1 year ago

This dataset is already available on the Hub: https://huggingface.co/datasets/facebook/belebele.