Multilingual tasks - Githubissues

malteos commented 1 year ago

TODO Languages:

Top languages with at least three tasks per language:

[ ] Spanish
[ ] Portuguese
[x] French:
- [x] https://huggingface.co/datasets/flue
- [x] https://huggingface.co/datasets/etalab-ia/piaf
[ ] Italian

Other languages with at least one task per language:

[x] Bulgarian
[x] Czech
[ ] Danish
[x] Greek
[ ] Estonian
[x] Finnish
[ ] Irish
[x] Croatian
[x] Hungarian
- [x] https://huggingface.co/datasets/ficsort/SzegedNER
[ ] Lithuanian
[ ] Latvian
[x] Maltese
[x] Dutch
[x] Polish
- [x] https://huggingface.co/datasets/allegro/klej-dyk
[ ] Romanian
[x] Slovak
[ ] Slovenian
[x] Swedish
[ ] Ukrainian

Multilingual

Done: German, English

akash418 commented 1 year ago

List of cross-lingual tasks and evaluation benchmarks:

XTREME: Cross-lingual Transfer Evaluation of Multilingual Encoders (XTREME) benchmark is a benchmark for the evaluation of the cross-lingual generalization ability of pre-trained multilingual models. URL: https://huggingface.co/datasets/xtreme
XQUAD: Benchmark dataset for evaluating cross-lingual question-answering performance. URL: https://huggingface.co/datasets/xquad
TyDiQA: Question-answering dataset covering 11 typologically diverse languages with 204K question-answer pairs. URL: https://huggingface.co/datasets/tydiqa
XNLI: Subset of a few thousand examples from MNLI which has been translated into 14 different languages (some low-ish resource). URL: https://huggingface.co/datasets/xnli
Wiki-Lingua Cross-Lingual Summarization: Multilingual dataset for the evaluation of cross-lingual abstractive summarization systems. URL: https://huggingface.co/datasets/GEM/wiki_lingua
MLQA: (MultiLingual Question Answering) is a benchmark dataset for evaluating cross-lingual question-answering performance. URL: https://huggingface.co/datasets/mlqa
PAWS-X: Cross-lingual Adversarial Dataset for Paraphrase Identification. This dataset contains 23,659 human-translated PAWS evaluation pairs and 296,406 machine-translated training pairs in six typologically distinct languages. URL: https://huggingface.co/datasets/paws-x
OPUS: English-centric, meaning that all training pairs include English on either the source or target side. The corpus covers 100 languages (including English). Selected the languages based on the volume of parallel data available in OPUS. URL: https://huggingface.co/datasets/opus100

List of Monolingual Benchmarks:

German

GERMAN-QUAD: Human-labeled German QA dataset consisting of 13 722 questions, incl. a three-way annotated test set. URL: https://huggingface.co/datasets/deepset/germanquad
GERMAN LER: A dataset of Legal Documents from German federal court decisions for Named Entity Recognition. URL: https://huggingface.co/datasets/elenanereiss/german-ler
OSCAR: Huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus. URL: https://huggingface.co/datasets/oscar
GERMEVAL: Shared Task builds on a new dataset with German Named Entity annotation with the following properties: - The data was sampled from German Wikipedia and News Corpora as a collection of citations. URL: https://huggingface.co/datasets/germeval_14
SUPER-GLUE: SuperGLUE (https://super.gluebenchmark.com/) is a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, improved resources, and a new public leaderboard. URL: https://huggingface.co/datasets/super_glue

French

FLUE: FLUE is an evaluation setup for French NLP systems similar to the popular GLUE benchmark. URL: https://huggingface.co/datasets/flue
SNLI-FRENCH: SNLI translated version for French. URL: https://huggingface.co/datasets/kseth919/snli-french
UD: Universal dependencies for each and every language. URL: https://huggingface.co/datasets/universal_dependencies
PIAF: Reading comprehension dataset. This version, published in February 2020, contains 3835 questions on French Wikipedia. URL: https://huggingface.co/datasets/etalab-ia/piaf

Slavic and Other Languages

PAN-X (WikiAnn): Multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organization) tags in the IOB2 format. URL: https://huggingface.co/datasets/wikiann
HUNGARIAN Szeged: The recognition and classification of proper nouns and names for two different domains. URL: https://huggingface.co/datasets/ficsort/SzegedNER
UD Portuguese: This dataset has been automatically processed by AutoNLP for project pos-tag-bosque. URL: https://huggingface.co/datasets/Emanuel/UD_Portuguese-Bosque
RuCoLA: Russian Corpus of Linguistic Acceptability (RuCoLA) is a novel benchmark of 13.4k sentences labeled as acceptable or not. URL: https://huggingface.co/datasets/RussianNLP/rucola
CONLL 2003: The shared task of CoNLL-2003 concerns language-independent named entity recognition. URL: https://huggingface.co/datasets/conll2003
RUSSIAN-SUPERGLUE: List of tasks for the Russian language. URL: https://huggingface.co/datasets/russian_super_glue
Polish CZY: The dataset consists of almost 5k question-answer pairs obtained from Czy wiesz, a section of Polish Wikipedia. URL: https://huggingface.co/datasets/allegro/klej-dyk

malteos commented 1 year ago

Thanks for the list. Can you also look for monolingual benchmarks in the respective languages?

malteos / finetune-evaluation-harness

Multilingual tasks #10

TODO Languages: