huggingface / hub-docs

Docs of the Hugging Face Hub
http://hf.co/docs/hub
Apache License 2.0
268 stars 229 forks source link

[hacktoberfest] Dataset `languages` challenge #986

Closed Wauplin closed 9 months ago

Wauplin commented 11 months ago

Issue to keep track of the "Dataset languages challenge" for the Hacktoberfest 2023.

Context

The Hugging Face Hub hosts hundreds of thousands of public models and datasets. These datasets and models cover a wide range of languages. One of the main ways in which it's possible to know what language a dataset is in is by looking at the language field in the dataset's metadata section of the dataset card.

language: 
- "List of ISO 639-1 code for your language"
- lang1
pretty_name: "Pretty Name of the Dataset"
tags:
- tag1
- tag2
license: "any valid license identifier"
task_categories:
- task1

Having this field filled in is essential for users to find datasets in their language and give a better idea of the languages that the Hub covers. However, the dataset's author has only sometimes filled this field. This challenge is to fill in the language field for datasets that don't have it filled in.

Instructions

Check out the instructions details here.

Feel free to ping @davanstrien or @Wauplin for any question or review.

stefan-it commented 11 months ago

Hi guys!

I opened some PRs for that - some were already merged. Is this issue the right way to track updates for these PRs :thinking:

status pr_url hub_id downloads likes
Opened here Photolens/DISC-Med-SFT-en-translated-only-CMeKG 0 1
Opened here manu/europarl-en-fr 0 0
Opened here buddhist-nlp/buddhist-zh-en-with-gpt 0 0
Opened here neil-code/subset-data-en-zh 0 0
Opened here dipteshkanojia/t5-qe-2023-ente-da-sys-test 0 0
Opened here dipteshkanojia/t5-qe-2023-enta-da-sys-test 0 0
Opened here dipteshkanojia/t5-qe-2023-enmr-da-sys-test 0 0
Opened here dipteshkanojia/t5-qe-2023-enhi-da-sys-test 0 0
Opened here dipteshkanojia/t5-qe-2023-engu-da-sys-test 0 0
Opened here dipteshkanojia/t5-qe-2023-ente-da-test 0 0
Opened here dipteshkanojia/t5-qe-2023-enmr-da-test 0 0
Opened here dipteshkanojia/t5-qe-2023-enta-da-test 0 0
Opened here dipteshkanojia/t5-qe-2023-enhi-da-test 0 0
Opened here dipteshkanojia/t5-qe-2023-engu-da-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-ente-da-sys-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-enta-da-sys-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-enmr-da-sys-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-enhi-da-sys-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-engu-da-sys-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-ente-da-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-enta-da-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-enmr-da-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-enhi-da-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-engu-da-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-enta-sys-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-ente-sys-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-enmr-sys-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-enhi-sys-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-engu-sys-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-ente-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-enta-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-enmr-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-enhi-test 0 0
Opened here dipteshkanojia/llama-2-qe-2023-engu-test 0 0
Opened here ahazeemi/opus-it-en-de-new 0 0
Opened here aimona/stripchat-fixed-grammar-eng 0 0
Opened here phi0108/demo-noun-phrase-en 0 0
Merged here ChanceFocus/flare-multifin-en 0 0
Merged here kaleinaNyan/wmt19_ru-en 0 0
Opened here VFiona/covid-19-synthetic-it-en-5000 0 0
Opened here ahazeemi/opus-law-en-de-new 0 0
Opened here VFiona/covid-19-synthetic-it-en-10000 0 0
Merged here flozi00/oasst1-en-to-de 0 0
Opened here pvduy/oasst-h4-en 0 0
Opened here yezhengli9/wmt20-en-ta 0 0
Opened here yezhengli9/wmt20-cs-en 0 0
Opened here yezhengli9/wmt20-en-cs 0 0
Opened here yezhengli9/wmt20-iu-en 0 0
Opened here yezhengli9/wmt20-en-ru 0 0
Opened here yezhengli9/wmt20-en-ps 0 0
Opened here yezhengli9/wmt20-ta-en 0 0
Opened here yezhengli9/wmt20-pl-en 0 0
Opened here yezhengli9/wmt20-en-zh 0 0
Opened here yezhengli9/wmt20-ps-en 0 0
Opened here yezhengli9/wmt20-en-pl 0 0
Opened here yezhengli9/wmt20-ru-en 0 0
Opened here yezhengli9/wmt20-en-iu 0 0
Opened here yezhengli9/wmt20-ja-en 0 0
Opened here yezhengli9/wmt20-en-ja 0 0
Opened here yezhengli9/wmt20-en-km 0 0
Opened here yezhengli9/wmt20-en-de 0 0
Opened here yezhengli9/wmt20-de-en 0 0
Opened here alvations/globalvoices-de-en 0 0
Opened here alvations/aymara-english 0 0
Opened here shreevigneshs/iwslt-2023-en-ru-train-val-split-0.2 0 0
Opened here shreevigneshs/iwslt-2023-en-pt-train-val-split-0.2 0 0
Opened here shreevigneshs/iwslt-2023-en-ko-train-val-split-0.2 0 0
Opened here shreevigneshs/iwslt-2023-en-vi-train-val-split-0.2 0 0
Opened here shreevigneshs/iwslt-2023-en-es-train-val-split-0.1 0 0
Opened here shreevigneshs/iwslt-2023-en-ko-train-val-split-0.1 0 0
Opened here shreevigneshs/iwslt-2023-en-vi-train-val-split-0.1 0 0
Opened here dandrade/es-en 0 1
Opened here cahya/instructions-en 0 0
Opened here shreevigneshs/iwslt-2023-en-vi-train-split-v1 0 1
Opened here shreevigneshs/iwslt-2022-en-de 0 0
Opened here shreevigneshs/iwslt-2023-en-ko-train-split 0 0
Opened here shreevigneshs/iwslt-2022-en-es 0 0
Opened here loresiensis/corpus-en-es 0 1
Opened here NadiaHassan/ar-en 0 0
Opened here Rexhaif/mintaka-qa-en 0 0
Opened here mbarnig/Tatoeba-en-lb 0 0
Opened here yogiyulianto/twitter-sentiment-dataset-en 0 0
Opened here vocab-transformers/wiki-en-passages-20210101 0 0
Opened here OpenFact/CLEF23-CheckThat-1b-en 1 0
Opened here thesistranslation/distilled-ccmatrix-es-en 1 0
Opened here thesistranslation/distilled-ccmatrix-en-es 1 0
Opened here thesistranslation/distilled-ccmatrix-fr-en 1 0
Opened here shreevigneshs/iwslt-2023-en-vi-train-split 1 0
Opened here marksverdhei/wordnet-definitions-en-2021 1 1
Opened here Jackmin108/c4-en-validation-mini 2 0
Opened here thesistranslation/distilled-ccmatrix-de-en 2 0
Opened here yezhengli9/wmt20-zh-en 2 0
Opened here masoudjs/c4-en-html-with-metadata-ppl-clean 2 0
Opened here indiejoseph/wikipedia-en-filtered 3 0
Opened here thesistranslation/distilled-ccmatrix-en-fr 3 0
Opened here lsb/million-english-numbers 3 0
Opened here vhtran/de-en-official 4 0
Opened here yongsun-yoon/open-ner-english 4 0
Opened here Shularp/un_multi-ar-en 4 0
Opened here vhtran/uniq-de-en 5 1
Opened here TigerResearch/tigerbot-wiki-qa-bart-en-10k 5 0
Opened here RafaelMPereira/HealthCareMagic-100k-Chat-Format-en 7 2
Opened here vhtran/de-en 8 0
Opened here vhtran/id-en 8 1
Opened here openmachinetranslation/tatoeba-en-fr 8 1
Merged here Photolens/oasst1-en 10 1
Opened here kunishou/databricks-dolly-69k-ja-en-translation 22 7
Opened here vhtran/de-en-2023 23 1
Merged here Suchinthana/Databricks-Dolly-15k-si-en-mix 24 0
Opened here alvations/globalvoices-en-es 33 1
Opened here stas/wmt16-en-ro-pre-processed 40 0
Opened here j0selit0/insurance-qa-en 64 3
Opened here manu/opus100-en-fr 76 0
Opened here dmayhem93/agieval-sat-en-without-passage 86 0
Opened here dmayhem93/agieval-logiqa-en 86 0
Opened here dmayhem93/agieval-sat-en 87 2
Opened here manu/wmt-en-fr 107 0
Opened here vhtran/uniq-id-en 118 0
Opened here stas/wmt14-en-de-pre-processed 423 1
Opened here Jackmin108/c4-en-validation 1131 0
Opened here cfilt/iitb-english-hindi 1147 11
Merged here argilla/databricks-dolly-15k-curated-en 9651261 9
davanstrien commented 11 months ago

Awesome work! Thanks so much ❤️

If any of these are also in the table here it would be great if you add the status/link to that table (as a PR). Let me know if it's tricky to update the table so much and I can do it from my end and include you in the PR!

stefan-it commented 11 months ago

Hey @davanstrien ,

no problem, I added it in #989 :)

stefan-it commented 11 months ago

Thanks for merging @davanstrien , I have a new MR coming with 115 new entries :hugs:

stefan-it commented 11 months ago

@Wauplin could you please label this issue also with a hacktoberfest label, thanks :hugs:

snehilsanyal commented 11 months ago

Hey @davanstrien I just made my first dataset card edit in one of the repos. What should be my next step? Should I open a new PR just like @stefan-it did, maybe a WIP one? Because as of now I just did one, but might do others as well. Any help will be highly appreciated :D

davanstrien commented 11 months ago

Hey @davanstrien I just made my first dataset card edit in one of the repos. What should be my next step? Should I open a new PR just like @stefan-it did, maybe a WIP one? Because as of now I just did one, but might do others as well. Any help will be highly appreciated :D

Thanks! Yes, would be great if you could make a PR with a link to the PR you made on the Hub. It's fine if you want to make one PR now and do more later :)

Wauplin commented 11 months ago

@Wauplin could you please label this issue also with a hacktoberfest label, thanks 🤗

Yes @stefan-it ! Finally done on the entire repo. Thanks for the letting us know and sorry for the confusion :)

snehilsanyal commented 11 months ago

Thank you @davanstrien Just opened a PR here: #997 Please let me know if something is missing, this is my first PR to hub-docs :D

davanstrien commented 11 months ago

Thanks @snehilsanyal! I've just approved/merged that PR :)