huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
701 stars 77 forks source link

upgrade datasets to 2.14 #1550

Closed severo closed 1 year ago

severo commented 1 year ago

https://github.com/huggingface/datasets/releases/tag/2.14.0

main changes:

TODO:

TODO: 2.14.4

albertvillanova commented 1 year ago

I could take care of this.

albertvillanova commented 1 year ago

I will address this issue with the patch release 2.14.1

severo commented 1 year ago

good idea. In particular, it contains:

No gzip encoding from github by @lhoestq in https://github.com/huggingface/datasets/pull/6076

I imagine we will have to refresh all the datasets with the StreamingRowsError error. If they are fixed: yay, and if not, the worker will fail quickly, so it should take little time. 1,034 datasets currently have StreamingRowsError on split-first-rows-from-streaming

albertvillanova commented 1 year ago

I am updating datasets to 2.14.4:

Fix authentication issues by @albertvillanova in https://github.com/huggingface/datasets/pull/6127

We will be able to remove some authentication tweaks, where we had to use download_config instead of token. See:

severo commented 1 year ago

I imagine we will have to refresh all the datasets with the StreamingRowsError error. If they are fixed: yay, and if not, the worker will fail quickly, so it should take little time. 1,034 datasets currently have StreamingRowsError on split-first-rows-from-streaming

I've launch the refresh of these:

db.cachedResponsesBlue.aggregate([{$match: {error_code: "StreamingRowsError"}}, {$group: { _id: null, datasets: { $addToSet: "$dataset" } }}])
{ _id: null,
  datasets: 
   [ 'abidlabs/crowdsourced-speech-demo',
     'turkish_movie_sentiment',
     'sunadori12/nva-anastasia',
     ...
    ] }

the list is here:

abidlabs/crowdsourced-speech-demo turkish_movie_sentiment sunadori12/nva-anastasia narugo/test_repo_20230508031643644725_a9a109af2967e99037c0e593272a762215ce70eb abidlabs/crowdsourced-speech-demo2 cmrc2018 KarosY/3l2l_1207 Gnartiel/dsc-UIT hansards narugo/test_repo_20230513145308551423_2099e1294ccc1a4259723d28deeee87d6ec0d900 datablations/oscar-subsets narugo/test_repo_20230417141916207801_99d6af23ee201ecfd527cb61d172a0b77a5e5202 jkot/czech_parliament_plenary_hearings will33am/AVA Rami/utd_reddit chrisjay/test-mnist-2 Chris1/celebA-HQ diwank/hinglish-dump mstz/soybean jigsaw_unintended_bias KETI-AIR/coco rvs007/lele_rvs Rui6188/50States10K Abdelkareem/zikir_detection vmalperovich/SST5 potsawee/podcast_summary_assessment TenzinGayche/STT_CS Amro-Kamal/ObjectPose nanaaaa/emotion_chinese thewall/jolma_domains biglam/gallica_literary_fictions wiki40b tarteel-ai/tlog fcabanilla/tobby2 amanneo/enron-mail-corpus-mini MohammedHB/AraPOS lst20 openslr cQueenccc/Office-Home savithri/layoutmv3 fabraz/writingPromptAug nthngdy/oscar-mini EdwardLin2023/MELD-Audio george-chou/piano narugo/test_repo_20230508031755902449_664aad8bc41f403a640bb95f8985ee43db37b98c narugo/test_repo_20230511154222208027_d0311fbc76f3a9f109169fb637781d8fdeca8136 magotan/1 godlzj/SD-AN bigbio/bionlp_st_2011_id HuggingFaceM4/howto100m ollie HuggingFaceM4/COCO masked-neuron/ccd tanupriyasingh1234/celeb-identities-2 DavidVivancos/MindBigData2022 society-ethics/laion2b-en-vit_embeddings punglee/librispeech_asr nlphuji/whoops-analysis v-xchen-v/truthfulqa_true watchurstepts/AnythingFaces AkumaSerenity/Celandra siavava/mmod2 anton-l/earnings22_baseline_5_gram crystina-z/inlang-mrtydi-corpus TrainingDataPro/generated-usa-passeports-dataset krenerd/chunked 4eJIoBek/Green-Elephant-Remixes ahlem-phantom/ls_generated urdu_sentiment_corpus bowphs/trial_c50 texonom/texonom-md openpecha/tibetan_voice_v2 Raul023/Paddy mariosasko/test_tags matallanas/lexFridmanPodcast metaeval/bigbench mozilla-foundation/common_voice_7_0 g0d/BroadcastingCommission_Patois_Dataset george-chou/pianos Ruohao/pcmr Axel578/mydt RUC-DataLab/rel-heter esc-bench/esc-datasets bigbio/bioasq_2021_mesinesp Leyo/TGIF Pavankalyan/chitti_data funnymdzz/diffsinger-chuansao258 narugo/test_repo_20230508031714786825_b13b69b2c1505416a1fd05e0247af3042ba78c0f mozilla-foundation/common_voice_2_0 KETI-AIR/aihub_dialog_summarization narugo/test_repo_20230508031747351241_91b4b3ea1bc4e18c28a61a30387b0b9acd70ef40 universal_morphologies abdusah/arabic_speech_massive Einstellung/CXR_Reports dariolopez/suicide-comment-es-space-human-feedback ShussarSDFA/WipThings GEM/xsum BFHF/16.02.2023 google/xtreme_s patrickvonplaten/librispeech_asr_self_contained mrm8488/unnatural-instructions zmeanszachary/ipl1 rungalileo/test relaimposter/smoothieV1 bdsaglam/musique arpelarpe/nota narugo/test_repo_20230508111905822829_11be182af3a84c3f02c10b5ed04380d3c7833804 poolrf2001/FaceMask crumb/testing-image-dataset-1k tti-bias/prof_report__SD_v2_random_seeds__multi__24 marianna13/handbook-chemistry-physics-tables Fyaer/test agrim2603/lilt_dataset hadikhamoud/test ICA-PUC/ccus_images KoddaDuck/Cylonix_ASR_dataset qgallouedec/prj_gia_dataset_metaworld_assembly_v2_1111 Yaxin/SemEval2015Task12NLTK ar_cov19 bprec SocialGrep/the-reddit-covid-dataset rdpahalavan/CIC-IDS2017 sheikh/FCD_lmv2 kirim9001/WaVcc Stardrums/pico-breast-cancer EMBO/biolang richardr1126/spider-schema faisal-hugging-face/plant-disease kingabzpro/Urdu-ASR-flags2 squad_adversarial spanish_billion_words colbertv2/lotte_passages sheikh/layoutlmv3 yuyang/bart_newsroom neuclir/neumarco flax-community/german_common_crawl florianbussmann/train_tickets-yu2020pick kotarodayo1126/nva-usadahikaru msr_genomics_kbcomp mozilla-foundation/common_voice_8_0 gisu/nva_desuko izumi-lab/ai2_arc-ja-mbartm2m rdg5/rdg5_indi RossVermouth/test_dataset rvashurin/wikidata_simplequestions reeink/dota alexshengzhili/SciCapPlus2LLAVA NamVo/word_embedding freddyaboulton/callback-test abidlabs/Urdu-ASR-flags2 narugo/test_repo_20230508031708590455_3b091a2a065d192b322557bece4fbc490d98be3d Falah/new_dataset2 orieg/elsevier-oa-cc-by Martha-987/vivos research-backup/conceptnet whizystems/whizystems_invoices TrainingDataPro/license_plates GizemG/emotionText Yaxin/SemEval2014Task4NLTK cemachelen/Leeds_SciML_SeaIce_2023 karmiq/glove CyberHarem/test_reines_fgo McGilbertus/infoleg_10k_ner amitness/logits-maltese so_stacksample larrylawl/multilexnorm mulcyber/europarl-mono tti-bias/prof_images_blip__dalle-2 cryptonite djaym7/wiki_dialog_mlm ProjectNekoFi/AnimeDiffusion freddyaboulton/new_saving_json xiaojuan0920/cskg_2 lighteval/copyright_helm iceberg-nlp/climabench jatinakad/RSI Evelyn18/becas imppres MITCriticalData/unlabeled-10-top-cities-16-bit-depth senti_ws gonzalobenegas/processed-data-arabidopsis crystina-z/no-nonself-mrtydi-corpus narugo/test_repo_20230513132427055361_9a2663d640864d3d0efdfad4ab57a23417b21238 trojblue/Public-Datasests KELONMYOSA/dusha_emotion_audio Salama1429/tarteel-ai-everyayah-Quran Rui6188/50States10K_Sample bigbio/seth_corpus gendisjawi/ipaipa poolrf2001/mask bouim/dvoice3 jimregan/clarinpl_studio jieunnie/ColorLand2 narugo/test_repo_20230511154138473720_ed3a3133cf6ae3bcbabad33bd6d88d690c489385 prk/testsq Perkhad/corejur zhoujy53/sync mozilla-foundation/common_voice_5_0 pierreguillou/DocLayNet-large mozilla-foundation/common_voice_4_0 ChaiML/edit_response_rm_dataset_processed LanceaKing/asvspoof2019 KETI-AIR/aihub_document_summarization sarulab-speech/bvcc-voicemos2022 shivangibithel/SOTAB bigcode/commits-pjj-2048 flexthink/ljspeech mwhanna/ACT-Thor ShoukanLabs/OpenNiji-380001_415000 pytc/AxonEM Msun/usv chrisjay/crowd-speech-africa wiki_snippets karmiq/glove-50d djaym7/wiki_dialog TempoFunk/small tasksource/bigbench darkproger/librispeech_asr id_liputan6 turkish-nlp-suite/turkish-wikiNER wisdomify/story pytc/NucMM ncats/EpiSet4NER-v1 tasksource/babi_nli MaCoCu/parallel_data reclor justram/COCO-2014-images wkrl/cord jakartaresearch/causalqa alexkueck/tis SpeechBigBench/EnhancementDetection_LibriTTSTestCleanWham um005 rossevine/CorpusAudio_v2 gannikim/stock_ant abidlabs/crowdsourced-test5 RuiqianLi/Li_singlish TheBossLevel123/mclandscape classla/ssj500k ugshanyu/jambal4 kingjambal/mnvoice hieuhocnlp/deep-research hltcoe/megawika GordZhao/face metaeval/recast camenduru/test-275001_310000 george-chou/pianos_mels TrainingDataPro/hand-gesture-recognition-dataset semaj83/ctmatch_classification freddyaboulton/dataset_json_5 jlmarrugom/cif_imgs jozierski/ecomwebtexts-pl smit-mehta/marvel-actors-faces nizamreplica/libritts-test-1 superb fractalego/QA_to_statements burakekim/mapinwild tdklab/Hebrew_Squad_v1 TeALAiN/nva-vii Serhii/Custom_SQuAD polinaeterna/earnings22 matinf narugo/test_repo_20230415143845584483_4e9edf0d20eb2c4e45d5a79bdbd2d429fc9e8c86 yuvalkirstain/PickaPic-ft-pairwise Livingwithmachines/MapReader_Data_SIGSPATIAL_2022 lj_speech lucainiao/MAESTRO_2004_SYNTH ami peixian/rtGender code_x_glue_ct_code_to_text newsroom BeIR/beir-corpus EMBO/sd-nlp-v2 zhangxuri/test makiour/dvoice-Darija narugo/test_repo_20230513132727042377_ddafd7d99cf45cce28be86f91c2c1d680af0abd0 taejunkim/djmix datablations/python-megatron KTH/waxholm lapki/test davanstrien/MAMe Slep/LAION-RVS-Fashion bigbio/pubmed_qa CustomHomeAI/2D-elevation-dataset lukaemon/mmlu 04-07-22/wep-probes SocialGrep/the-antiwork-subreddit-dataset gofixyourself/EasyPortrait youssef101/artelingo wbxlala/HAR tarou537/nva-horo tyouisen/aclue albertvillanova/sat Drewd/lex_fridman_podcast_transcripts Glac1er/idktest XiangPan/multi_nli_with_bias_split ubuntu_dialogs_corpus Malisha/TTFormLMM shivangibithel/SATO Alperennn/RSNA_BreastCanser jkot/dataset_merged_preprocessedv2 McGill-NLP/full-wiki-segments-parquet narugo/test_repo_20230508093021403058_c8bab27edcec03f0021233008347df45c77bfb01 noorlight/captioned_dataset jonatli/the_pile_mystic MichiganNLP/scalable_vlm_probing Hoseindb/a-private-dataset THU-StarLab/test_evaluation_dataset harshal-07/activity_detection times_of_india_news_headlines zpn/pubchem-selfies grabbysingh/funsd Champion/vpc2020_clear_anon_speech ThraggBilly/billy_dataset55 ecoue/nordmann2023 DFKI-SLT/gids AIML-TUDA/face_attribute_benchmark mjwong/amazon_reviews_multi-bezt guardian_authorship ZihaoLin/zhlds ArmelR/stack-exchange-instruction BeIR/beir changlin13/pdpc_faqs maxardito/beatbox fquad magotan/5 HAERAE-HUB/csatqa arabic_billion_words narugo/test_repo_20230513130705823866_77f7cd3170cb79e73812ca3a52465cb0253e8fca SIA86/LFQAKnowledgeBase neelalex/raft-predictions BSC-LT/viquiquad renumics/speech_commands_enriched giulio98/xlcost-single-prompt ggxxii-AI/testing GEM/wiki_auto_asset_turk carlosejimenez/seq2seq-glue immn01484/lora_train bstds/geco_data_generator jslin09/wikipedia_tw GIL-UNAM/negation_twitter_mexican_spanish Ubenwa/CryCeleb2023 apetulante/mortars_test AI4EPS/quakeflow_das l-yohai/ASAP Yukang/Pile-subset patriziobellan/PET jmamou/augmented-glue-sst2 BSC-LT/tecla ms_terms trojblue/public-datasests Joanne/UBMI ncats/EpiSet4BinaryClassification TrainingDataPro/face_masks biu-nlp/qa_discourse allenai/peer_read narugo/test_repo_20230508090340219358_468d45ec9ef845a86e9dca2a1b115f315f2c64d6 KenDoStudio/Burnout3_DJStryker Adorg/ToolBench BritishLibraryLabs/EThOS-PhD-metadata inquisitive_qg mstz/fertility tti-bias/prof_report__SD_v1.4_random_seeds__multi__24 jianghuzhenyu/Atari_floringogianu sijpapi/batch13 vldsavelyev/murakami KoddaDuck/dataset_backup danasone/taiga daydrill/QG_aihub metaeval/linguisticprobing VityaVitalich/IMAD narugo/test_repo_20230513130426526634_09cda9529a52063112d829614bf671b5585b93ad MohammedHB/AraPOS2 jigsaw_toxicity_pred princeton-nlp/glue_fairseq_format bigcode/commitpackft AkikoOu/hqzBeijingOpera-images davanstrien/european_art Smoden/ALICE_IMAGE_DATASET TrainingDataPro/facial-hair-classification-dataset narugo/test_repo_20230513145125285205_229312073a9d675ca0461ad35803b8b1b3abee5d recipe_nlg prvInSpace/banc-trawsgrifiadau-bangor seyidov579/azerbaijan Shularp/Process_tested minwook/novelImg vctk KShivendu/dbpedia-entities-openai-1M msr_zhen_translation_parity jimregan/clarinpl_sejmsenat jianghuzhenyu/VIMA_data eduge Sampson2022/demo2 Yaxin/SemEval2016 SocialGrep/ten-million-reddit-answers Glac1er/June qwq233/zzj rfernand/basic_sentence_transforms aeropriest/ariel mdd story_cloze wics/ceval tgelton/GoEmotions mikael17125/ronin_pretrain CyberHarem/noah_nikke taqwa92/tm_data datablations/c4-filter-megatron nandovallec/df_ps_train_extra narugo/test_repo_20230413051810478970_8f9f45b9e8bf07205aa080b46dc69dfc6db88bd3 Andy-Messer/Swiss_German_Audio_processed_2 alvations/stash hebrew_this_world TARO0224/nva-rui honlzl/generated weiji14/hlsfm_burn_scar howey/super_scirep_test zmeanszachary/ipl luoxiaojun1992/autotrain-data-luban eezy/basic_shapes_10k LIAMF-USP/arc-retrieval-c4 rogerdehe/xfund tti-bias/prof_images_blip__SD_v1.4_random_seeds narugo/test_repo_20230508090734379602_f5260aba079eae9a8fbbee6402313e00a0003547 narugo/test_repo_20230508111107383213_5b85cd1c6753ac97f6d68f1bb75369e2310cb604 biu-nlp/qamr sartajekram/BanglaRQA bjoernp/german_pretrain_mono_jsonl SocialGrep/the-reddit-place-dataset GEM/opusparcus datablations/oscar-filter-megatron PolyAI/minds14 narugo/test_repo_20230511154031349800_422a81a97a9816b40acc64334f7e379e5eae9232 albertvillanova/tmp-imagefolder-remote albertvillanova/tmp-mse Chaymaa/roic_donuts alcanodi/images_stb_dfs dfki-nlp/tacred duongttr/combined-pretrained-dataset Harveenchadha/indic-voice hapandya/sqnnr luck4ck/pre_hospitial_care nyanko7/coco-hosted mn367/radio-dataset-test maximedb/mcqa_light jordancaraballo/alaska-wildfire-occurrence shevek/LULC Sotaro0124/Ainu-Japan_translation_model ruanchaves/hashset_manual narugo/test_repo_20230508031853132835_fb0871416785c5d5ffb7d667e20046c084556f1f ebrigham/agnewsadapted Tverous/claim2 KshitizPandya/GenzTranscribe-hi CyberHarem/mashiro_bluearchive Xieyiyiyi/ceshi0119 narugo/test_repo_20230417075551505162_a0e97c8baea68680c9cff1ff9792c6ead93d534a jacksonkstenger/lofiHipHop ai4bharat/kathbath davanstrien/test1 torchgeo/l7sparcs rkstgr/mtg-jamendo SeanSleat/vctk newsqa Rami/utd_reddit.json narugo/test_repo_20230511154011842367_282f87f5de605993cc2c50bff1b78b3e6c0e3d40 telugu_books thacio/tokenized-concat-wiki-gov-rand30M-2048 sileod/mindgames narugo/test_repo_20230513145057750905_eee9c0366cf018b60b98e0b169dade0a079658ed SotiriosKastanas/trygroto nateraw/quickdraw sustcsenlp/SUBESCO nouamanetazi/ar_opus100_processed DFKI-SLT/knowledge_net emily49/hateful_memes_test kingabzpro/Rick-bot-flags abidlabs/callback-test autshumato janak2/3second yuansui/GitTables ugshanyu/jambal ArtifactAI/arxiv-math-instruct-50k Salesforce/dialogstudio yoruba_text_c3 Paulborowy/pictures imvladikon/nemo_corpus KoddaDuck/fleur GEM/squality mesolitica/malaysian-news RiTA-nlp/ITALIC jbrat/scienceqa pdearena/NavierStokes-2D-conditoned ArtifactAI/arxiv-cs-ml-instruct-tune-50k kartik727/Test_Dataset narugo/test_repo_20230513145053286622_f986c57da603f5a92e2fb05b8b59d692099a1b28 bigbio/pubhealth jglaser/pdb_protein_ligand_complexes BrianWan221/trial mazkooleg/google_speech_commands_augmented_fe_facebook-wav2vec2-base BSC-LT/ancora-ca-ner biglam/early_printed_books_font_detection allenai/lila venetis/customer_support_sentiment_on_twitter Jornt/calculations davanstrien/test_imdb_embedd zahoor54321/Urdu-ASR-flags anton-l/common_language Q78KG/opencpop-segments PranomVignesh/test BramVanroy/ud_dutch_lassysmall TrainingDataPro/grocery-shelves-dataset bigscience/P3 freddyaboulton/callback-test-3 lksy/ru_instruct_gpt4 poleval2019_mt zmeanszachary/adad AzadDjan/cord narugo/test_repo_20230508031641333198_ed7feea6e7d2dee003388805e23223fe649cfd88 bouim/dvoice2 atenglens/taiwanese_english_translation jcantlord/myphotos narugo/test_repo_20230414145118309496_a0f42f87857c91c5542c6a2d064dbd26071e8b39 tanmaykm/indian_dance_forms ewof/koishi-instruct-metharme albertvillanova/tmp-imagefolder-remote-2 narugo/test_repo_20230508111121858411_c5243a43e6c6e4211a5c66c7463a9b338e1e147a narugo/test_repo_20230513132509308466_98217ef52be7daf9d64d8a1b2b57b8cd63d2fc09 MohamedExperio/ICDAR2019 mikeee/chroma-paraphrase-multilingual-mpnet-base-v2 shwetkm/TextCaps-Caption-Summary KETI-AIR/aihub_book_summarization ThierryZhou/test freddyaboulton/callback-test-2 moro23/Hausa-ASR-flags classla/setimes_sr AnanthZeke/naamapadam hamza50/physical_activity leo123/squad_posgrados taqwa92/mg21_data albertvillanova/mtet ivelin/ui_refexp longevity-genie/openai_6000_chunk_modules camimo/sukasuka-Dataset coached_conv_pref nateraw/wit arpitamangal/flower-blip-weights winvoker/lvis alexwww94/SimCLUE gsarti/change_it open-asr-leaderboard/datasets Evelyn18/becasv3 tttarun/indic_superb_hindi darksensei/vqabd-test2 imthanhlv/binhvq_dedup chrisxx/laion2b-en-10K-subset narugo/test_repo_20230426065336463945_7d14bb4aceafc498a704dbfb9b54c5a95b4bf250 dlwh/eu_wikipedias wmt19 MITCriticalData/Unlabeled_top_10_cities_forward_backward_alg openclimatefix/nimrod-uk-1km mekaneeky/masked_language_model_v0_1 bigbio/bioscope GEM/xwikis narugo/test_repo_20230508111054635420_5f1a03974c7ad389c2d1446016edee772fd87c8d ivelin/rico_sca_refexp_synthetic cakiki/args_me Rossil/realnewslike tasksource/crowdflower vipin0803/easyreach_test EdwardLin2023/MELD_Audio_3Labels sem_eval_2020_task_11 Omar2027/caner_replicate udayl/ALPR_aman narrativeqa trojblue/Public-Datasest SocialGrep/the-reddit-climate-change-dataset GEM/BiSECT RJKiseki/CAMELYON16 bgglue/bgglue sngsfydy/aptos_train shreyasharma/masked_step_label RGBD-SOD/rgbdsod_datasets irc_disentangle kartikay24/User-Testing L4NLP/LEval lighteval/lextreme Howuhh/nle_hf_dataset nvm472001/cvdataset-layoutlmv3 bio-datasets/e3c PremDeep/AM-Industry albertvillanova/test narugo/test_repo_20230426065301796402_6d378d55998aa6d9fa90aa1adc33cde607dbb292 MLCommons/peoples_speech cakiki/paperswithcode kotarodayo1126/nva-zero bigbio/medhop sheikh/SLR narugo/test_repo_20230415111923667161_683612cdd06c93db51f4b0217f436fbd0fbad0cf Sadashiv/Plant-Diseases-Dataset rossevine/CorpusAudio_V4 hippocorpus zmao/food_img_caption zwlShawn857/tUbeNet_Example_Dataset SocialGrep/the-reddit-irl-dataset narugo/test_repo_20230508111001086572_80c2ab8c84add44f5d361ae7090ffef9107a6ea4 narugo/test_repo_20230513132414982589_1de1e397d8e465cf598872c0fd8d40c441032bd7 eezy/basic_shapes_1000 unaidedelf87777/openapi-function-invocations-25k guydegnol/bulkhours mozilla-foundation/common_voice_6_0 revdotcom/earnings22 KETI-AIR/aihub_paper_summarization polinaeterna/vox_lingua lapix/CCAgT nlphuji/beyond_web_scraping AI4EPS/quakeflow_nc narugo/test_repo_20230513145337038401_7437bb29dbe6adde22da48600b0392b1fbbf811a narugo/test_repo_20230508090245232363_bf61def7dac6e7678b8a20c8560855c571d2a305 texturedesign/td02_urban-surface-textures BramVanroy/hebban-reviews davebulaval/RISCBAC jamescalam/unsplash-25k-photos KaraKaraWitch/MyselfAndEveryone BramVanroy/ud_dutch_alpino svenjars/dataset_new ArielACE/akira_toriyama semeru/completeformer-masked tti-bias/prof_report__dalle-2__multi__24 tai94bn/hopdong ZongqianLi/Dye_Sensitized_Solar_Cells_Papers_RSC Fucheng/train_data shivangibithel/Flickr8k Pavithra/sampled-code-parrot-train-100k narugo/test_repo_20230508111856245832_1ca22dea2edb42f39bbb41a66aea0c723241bbce diwank/silicone-merged Fraser/wiki_sentences oaklight/tvsg-llm-derived-dataset bigbio/meddialog RJKiseki/TCGA taeshahn/ko-lima jinmang2/ucf_crime pdearena/NavierStokes-2D flores Sofoklis/hp_dataset rossevine/CorpusAudio_v3 amitness/logits-italian fetch-rewards/inc-duplicates-war-1807 Glac1er/May kaist-ai/CoT-Collection_multilingual craigslist_bargains nlp-thedeep/humsetbias Yaxin/SemEval2016Task5NLTK rudraml/fma unwilledset/raven-data CyberHarem/kalina_girlsfrontline howey/super_scirep novay/gender-detections innnky/taffynyaru epsilonator/double_numbers rlasseri/test-OrangeSum-small loyi/my_music_trans julien-c/autotrain-dreambooth-marsupilami-data camenduru/test-345001_380000 narugo/test_repo_20230508031633694900_e7dfc9a6526a13842fe374a47832a5764a49375f HamdiJr/Egyptian_hieroglyphs EMBO/sd-nlp-non-tokenized DFKI-SLT/multitacred thewall/jolma_unique pytc/NucExM bio-datasets/e3c-llm dlb/plue janak2/3second-small-2 hails/asdiv AlexFierro9/imagenet-1k_test KETI-AIR/vqa style_change_detection barry556652/data aymanelmar/joha mstz/bank cq01/Math23K arxiv_dataset HausaNLP/Naija-Lex fedryanto/qas multilingual_librispeech sijpapi/funsds esc-benchmark/esc-datasets SocialGrep/reddit-nonewnormal-complete Fsoft-AIC/the-vault-inline nyanko7/vbp-cached H2KP/cdip-annotations-formnet CIRAL/ciral QuanticBit/autotrain-data-fimages polinaeterna/push_to_hub_singe_nondefault_config metaeval/blimp_classification MightyStudent/Egyptian-ASR-MGB-3 KyonBS/DokudamiDS ugshanyu/jambal3 arsentd_lev moodlep/dt_atari_replay_hf joelito/lextreme DarthReca/california_burned_areas telugu_news biglam/yalta_ai_tabular_dataset covid_qa_ucsd nanom/spanish_dataset_test covost2 narugo/test_repo_20230508111814511378_424f9841727edf6cc55d0390cb92d5a34fe5c314 polsum kannada_news TobiTob/CityLearn CHSTR/rock_glacier eduardoprea44/deepfashion-multimodal khalidalt/model-written-evals fusing/geodiff-example-data biwi_kinect_head_pose oyk100/ChaSES-data ai4bharat/samanantar yuxiangwang/flat_relation janak2/3second-small narugo/test_repo_20230508111110220640_1d059f57949a504c250f7d58577f6603747fa6dd ccdv/mediasum AsakusaRinne/gaokao_bench thesistranslation/wmt14 foduucom/table-detection-yolo suolyer/book_zlib_part evelyncsb/ccus_imagebind_v1 freddyaboulton/new_saving_csv_8 Enutrof/English-NigerianPidgin-Result-Validation freddyaboulton/new_saving_csv_9 abidlabs/crowdsourced-test3 spacemanidol/cc-stories lighteval/bigbench_helm lighteval/lsat_qa abidlabs/Urdu-ASR-flags roskoN/dstc8-reddit-corpus Protegee/Ciri_dataset_93img biglam/nls_chapbook_illustrations favs/favsbot narugo/test_repo_20230511154254054883_9dfa587764f7956b3cde0383a19c4d0dcbb0628f prashanthpillai/docvqa_1200_examples antonkulaga/openai_6000_chunk_papers narugo/test_repo_20230508093058873074_74eb6795078c5155d250b743c23a9b3fff8cd8e0 uva-irlab/canard_quretec AresEkb/prof-standards-sbert-large-mt-nlu-ru livinNector/wikipedia anjalyjayakrishnan/sample biglam/unsilence_voc Syrina/donuts_dataset DFKI-SLT/tacred teknofest2022/2022-model-weights xglue allenai/cord19 eli5 pepa/bg-fake-news narrativeqa_manual wmt18 george-chou/pianos_mel Rui6188/50States2K mnbvcx/XFUND-LiLT 32j3fd3d23nvewj23frd/mistebisbividyotetki spacemanidol/PDFind-corpus rossevine/CorpusAudio pdearena/Maxwell-3D BSC-LT/xquad-ca jfrenz/legalglue Yaxin/SemEval2015 deepghs/generic_characters KETI-AIR/aihub_summary_and_report HuggingFace-CN-community/Diffusion-book-cn zeyneppktemm/deneme comodoro/vystadial2016_asr jieunnie/ColorLand jainr3/diffusiondb-pixelart baho05/baho ProjectNekoFi/SwordArtDiffusion 1aurent/individuality-of-handwriting keminglu/InstructOpenWiki sobir-hf/tajik-text-segmentation joshuav1/test fqa-cyber/TExtPhish KETI-AIR/klue kingabzpro/Urdu-ASR-flags mozilla-foundation/common_voice_5_1 StampyAI/alignment-research-dataset mozilla-foundation/common_voice_10_0 CyranoB/polarity KoddaDuck/fleurs johnt/got nateraw/musicgen-samples atomic bigbio/linnaeus Allen166/hftestrepo2 thewall/jolma azad-wolf-se/MH-FED davanstrien/test_imdb_embedd2 GIL-UNAM/SpanishParaphraseCorpora nguyenvulebinh/libris_clean_100 sxu/CANLI BDas/ner nchlt nlaz/images crystina-z/no-nonself-title-mrtydi-corpus narugo/test_repo_20230508031757895182_da63ae623ab02e37e83f36b7b6543c7dab854c5d verbrannter/invoice_dataset-batch5 RAPTORIDK/Face yasir-datascience/Wav2Vec-Finetune-English Nacken/kirstenbilder lighteval/pile_helm narugo/test_repo_20230508090405581473_6b2ffa27d2531df68aebf5adf85b788f75524ca5 GEM/wiki_lingua narugo/test_repo_20230508110956705984_894f3a011d0fbc38728d8165afb9c5c0c2ac9a0a Ssua/testdata fp16-guy/grids KETI-AIR/aihub_spoken_language_translation narugo/test_repo_20230511154334623927_359b1bec0ee7d622c1914a46ba1e108d5875deb6 sdfhg5243/pepes2 masakhane/afriqa-gold-passages Koteee/dbtest savithri/savlayoutmv3 MITCriticalData/unlabeled-5-top-cities-16-bit-depth iamkzntsv/ixi2d artificialhoney/graffiti-old robertmyers/pile_v2 osbm/prostate158 sanshanya/eyesdiffusion clarin-pl/multiwiki_90k DBL/test Bisi/DivSumm paniniDot/sci_lay doqa disi-unibo-nlp-org/COMMA PranomVignesh/builder-script-test narugo/test_repo_20230513145145760974_c8f52cd2230bb82cb3617351572cb75733c7075a flue bouim/dvoice3_alltrain cringgaard/boats_dataset cooleel/xfund_de chenxwh/gen-xcopa BigScience/P3 auliaadila/indspeech-news-lvcsr turkish_shrinked_ner huggingface/label-files jamescalam/pokemon fersebas/Fer Darkenlord1/pics AlhagAli/Poor_Quality texturedesign/td01_natural-ground-textures Zaid/tatoeba_mt research-backup/conceptnet_high_confidence spktsagar/openslr-nepali-asr-cleaned wikihow jordyvl/rvl_cdip_easyocr opentensor/openvalidators-test msr_text_compression severo/LILA narugo/test_repo_20230513130451109231_ab43b577f394a162090e030390b89103094b5a1e narugo/test_repo_20230513131130297825_7974526645ee6c109ef429c23072efc805c0ac2f HuggingFaceM4/TGIF MMInstruction/M3IT-80 veriga/happysad zhanghanchong/css tti-bias/prof_images_blip__SD_v2_random_seeds florianbussmann/FUNSD-vu2020revising research-backup/semeval2012_relational_similarity doc2dial narugo/test_repo_20230414144822025169_7db7e654f5161ef5c3285ed4441107aad0e171ed narugo/test_repo_20230508092803678577_024a107f95e87fea3c35f48371ff7e54dead931c SIA86/TechnicalSupportCalls biu-nlp/Controlled-Text-Reduction-dataset astroy/WHU-Urban-3D flexthink/librig2p-nostress-space brwac Howuhh/nld-aa-taster pufanyi/MIMICIT BerMaker/beans NLPC-UOM/Sinhala-POS-Data badokorach/NewQA alpindale/visual-novels rokmr/pets evelyncsb/ccus_images esc-bench/esc-diagnostic-dataset
AndreaFrancis commented 1 year ago

For Refresh all the datasets with only one config This is the list of datasets that I am launching for force-refresh for step dataset-config-names: datasets-one-config.csv

I used the following query to get the datasets: datasets_server_cache> db.cachedResponsesBlue.find({kind:"dataset-config-names", http_status:200, "content.config_names":{$size:1}, "content.config_names.0.config":{$ne:"default"}, "content.config_names.0.config":/--/})

AndreaFrancis commented 1 year ago

For datasets with one config: db.cachedResponsesBlue.countDocuments({kind:"dataset-config-names", http_status:200, "content.config_names":{$size:1}, "content.config_names.0.config":{$ne:"default"}})

We still have 25548, but I would like to continue force refreshing them incrementally, currently, we have a load of 859K pending jobs in the queue and would like to avoid overloading db.

AndreaFrancis commented 1 year ago

We have still 23402 datasets with one config to backfill. It is pending because of queue overload.

severo commented 1 year ago

We have still 23402 datasets with one config to backfill. It is pending because of queue overload.

Is it fixed now @AndreaFrancis?

lhoestq commented 1 year ago

Not yet I think, but the queue is empty now so feel free to fill it :)

AndreaFrancis commented 1 year ago

I will continue with the refresh for datasets with one config

AndreaFrancis commented 1 year ago

I finished updating datasets with one config. Only the following records (331) are missing but those are because they don't exist in the hub. Maybe we should remove those records? missing.csv

severo commented 1 year ago

Yes, good question: how to detect and clean this kind of remaining cache entries?

We have a dedicated issue: https://github.com/huggingface/datasets-server/issues/1285, maybe it's worth commenting there.

Also, look at https://github.com/huggingface/datasets-server/issues/1219: we already recompute old entries, I'm not sure if they are deleted, or not, when the dataset does not exist anymore.

AndreaFrancis commented 1 year ago

Yes, good question: how to detect and clean this kind of remaining cache entries?

I wonder, why don't they get deleted when calling webhook? https://github.com/huggingface/datasets-server/blob/main/services/api/src/api/routes/webhook.py#L77 maybe we have something buggy there or maybe the webhook was not called?

AndreaFrancis commented 1 year ago

Also, look at https://github.com/huggingface/datasets-server/issues/1219: we already recompute old entries, I'm not sure if they are deleted, or not, when the dataset does not exist anymore.

No, in this case when calling either force-refresh or dataset-backfill, there is no action performed because get_dataset_git_revision throws a NotFound exception that will prevent doing any operation in regards to old entries.

https://github.com/huggingface/datasets-server/blob/main/services/admin/src/admin/routes/dataset_backfill.py#L50 https://github.com/huggingface/datasets-server/blob/main/services/admin/src/admin/routes/force_refresh.py#L72

Some of those records might be deleted once we implement TTL in cache collection.

severo commented 1 year ago

why don't they get deleted when calling webhook?

They are generally deleted, but it seems like not all the cases are processed correctly.

It would be good to have some metrics about that (not a priority) and investigate why some of them are not deleted.