huggingface / dataset-viewer

Backend that powers the dataset viewer on Hugging Face dataset pages through a public API.
https://huggingface.co/docs/dataset-viewer
Apache License 2.0
695 stars 77 forks source link

Only one cache entry, for the first step #2274

Closed severo closed 9 months ago

severo commented 10 months ago

See causalnlp/corr2cause

Capture d’écran 2024-01-11 à 09 49 42

The first step was successful, but no other step was computed.

reported here: https://huggingface.co/datasets/causalnlp/corr2cause/discussions/5

severo commented 10 months ago

Trying to see if there are other occurrences with:

db.cachedResponsesBlue.aggregate([
    {$group: {
        _id: "$dataset",
        count: {$sum: 1}
    }},
    {$match: {count: 1}}
])
severo commented 10 months ago

Not that much (60 datasets):

{ _id: 'speed1/nattan', count: 1 }
{ _id: 'wikimedia/wikipedia', count: 1 }
{ _id: 'bbaw_egyptian', count: 1 }
{ _id: 'joey234/mmlu-electrical_engineering-neg-prepend-verbal', count: 1 }
{ _id: 'imdatta0/ultrachat_1k', count: 1 }
{ _id: 'CyberHarem/fang_arknights', count: 1 }
{ _id: 'focia/private_instagram', count: 1 }
{ _id: 'DeepFoldProtein/CATH_v4.3_S35_processed_512_test', count: 1 }
{ _id: 'Vivek1234321/multi-cloud-train', count: 1 }
{ _id: 'fu1995/shuimo-image-dataset', count: 1 }
{ _id: 'adi-kmt/airoboros-3.2_kn', count: 1 }
{ _id: 'Gbssreejith/death_type_42_dataset', count: 1 }
{ _id: 'GunA-SD/DataX', count: 1 }
{ _id: '0x7194633/persona-data-v1', count: 1 }
{ _id: 'Kalfrin/edset', count: 1 }
{ _id: 'Denilsonic/Samples', count: 1 }
{ _id: 'barolr/text_am-sum', count: 1 }
{ _id: 'sergei202/nexus-function-calling', count: 1 }
{ _id: 'xwjzds/pretrain_repeat_paraphrase', count: 1 }
{ _id: 'chunping-hf/my_audio', count: 1 }
{ _id: 'buzzcraft/ELI5-NO', count: 1 }
{ _id: 'racheltong/VA_test1', count: 1 }
{ _id: 'evanfrick/human_eval', count: 1 }
{ _id: 'Toastmachine/Pinescript-test', count: 1 }
{ _id: 'gowitheflowlab/parallel-pt-nl-pl', count: 1 }
{ _id: 'DeepFoldProtein/SCOP-1.65_processed_512', count: 1 }
{ _id: 'yn01/test_20240109_01', count: 1 }
{ _id: 'MysticMss/EUVOZ', count: 1 }
{ _id: 'Mashengshuaiqi/myfirstdataset', count: 1 }
{ _id: 'manishiitg/en-hi-raw', count: 1 }
{ _id: 'wjwow/FreeMan', count: 1 }
{ _id: 'thanhtlx/test-fix-cmg-time-split', count: 1 }
{ _id: 'yimingzhang/uf_safe_v1', count: 1 }
{ _id: 'WJYBUPT/law_item', count: 1 }
{ _id: 'iwasjohnlennon/JayAraeEssexArchive', count: 1 }
{ _id: 'causalnlp/corr2cause', count: 1 }
{ _id: 'htryj/instruction', count: 1 }
{ _id: 'xwjzds/paraphrase_collections_enhanced', count: 1 }
{ _id: 'cj-mills/cvat-instance-segmentation-toy-dataset', count: 1 }
{ _id: 'Shakib75/cpp-programs', count: 1 }
{ _id: 'Crysiss/lawdataset', count: 1 }
{ _id: 'ayushtues/instaflow_images', count: 1 }
{ _id: 'porkuaranha/joba', count: 1 }
{ _id: 'phamtungthuy/cauhoiphapluat', count: 1 }
{ _id: 'azrai99/data-scientist-jobstreet-dataset', count: 1 }
{ _id: 'mark434/combined', count: 1 }
{ _id: 'jksheth/r_j5', count: 1 }
{ _id: 'reciprocate/pku_safer_dpo_pairs', count: 1 }
{ _id: 'version-control/ds-lib-extract-1m', count: 1 }
{ _id: 'casecrit/2024-indonesian-election', count: 1 }
{ _id: 'SofiaVouzika/test-liver', count: 1 }
{ _id: 'senhorsapo/vanelope', count: 1 }
{ _id: 'AshanGimhana/Testingdata', count: 1 }
{ _id: 'focia/image_shot_dataset', count: 1 }
{ _id: 'VietAI/vi_mednli', count: 1 }
{ _id: 'philschmid/trl-test-instruction', count: 1 }
{ _id: 'feazer/nva-WRSPGR', count: 1 }
{ _id: 'Berzerker/gnhk_ocr_dataset', count: 1 }
{ _id: 'razent/vi_pubmed_small', count: 1 }
{ _id: 'Recag/Rp_C4_50', count: 1 }
severo commented 10 months ago

All have one entry only for the first step, saving one which has only one entry for dataset-split-names

db.cachedResponsesBlue.aggregate([
    {$match: {
        dataset: {$in: ['speed1/nattan','wikimedia/wikipedia','bbaw_egyptian','joey234/mmlu-electrical_engineering-neg-prepend-verbal','imdatta0/ultrachat_1k','CyberHarem/fang_arknights','focia/private_instagram','DeepFoldProtein/CATH_v4.3_S35_processed_512_test','Vivek1234321/multi-cloud-train','fu1995/shuimo-image-dataset','adi-kmt/airoboros-3.2_kn','Gbssreejith/death_type_42_dataset','GunA-SD/DataX','0x7194633/persona-data-v1','Kalfrin/edset','Denilsonic/Samples','barolr/text_am-sum','sergei202/nexus-function-calling','xwjzds/pretrain_repeat_paraphrase','chunping-hf/my_audio','buzzcraft/ELI5-NO','racheltong/VA_test1','evanfrick/human_eval','Toastmachine/Pinescript-test','gowitheflowlab/parallel-pt-nl-pl','DeepFoldProtein/SCOP-1.65_processed_512','yn01/test_20240109_01','MysticMss/EUVOZ','Mashengshuaiqi/myfirstdataset','manishiitg/en-hi-raw','wjwow/FreeMan','thanhtlx/test-fix-cmg-time-split','yimingzhang/uf_safe_v1','WJYBUPT/law_item','iwasjohnlennon/JayAraeEssexArchive','causalnlp/corr2cause','htryj/instruction','xwjzds/paraphrase_collections_enhanced','cj-mills/cvat-instance-segmentation-toy-dataset','Shakib75/cpp-programs','Crysiss/lawdataset','ayushtues/instaflow_images','porkuaranha/joba','phamtungthuy/cauhoiphapluat','azrai99/data-scientist-jobstreet-dataset','mark434/combined','jksheth/r_j5','reciprocate/pku_safer_dpo_pairs','version-control/ds-lib-extract-1m','casecrit/2024-indonesian-election','SofiaVouzika/test-liver','senhorsapo/vanelope','AshanGimhana/Testingdata','focia/image_shot_dataset','VietAI/vi_mednli','philschmid/trl-test-instruction','feazer/nva-WRSPGR','Berzerker/gnhk_ocr_dataset','razent/vi_pubmed_small','Recag/Rp_C4_50']}
    }},
    {$group: {
        _id: '$kind',
        count: {$sum: 1}}
    }
])
{ _id: 'dataset-config-names', count: 59 }
{ _id: 'dataset-split-names', count: 1 }

The exception is feazer/nva-WRSPGR:

db.cachedResponsesBlue.find({
    dataset: {$in: ['speed1/nattan','wikimedia/wikipedia','bbaw_egyptian','joey234/mmlu-electrical_engineering-neg-prepend-verbal','imdatta0/ultrachat_1k','CyberHarem/fang_arknights','focia/private_instagram','DeepFoldProtein/CATH_v4.3_S35_processed_512_test','Vivek1234321/multi-cloud-train','fu1995/shuimo-image-dataset','adi-kmt/airoboros-3.2_kn','Gbssreejith/death_type_42_dataset','GunA-SD/DataX','0x7194633/persona-data-v1','Kalfrin/edset','Denilsonic/Samples','barolr/text_am-sum','sergei202/nexus-function-calling','xwjzds/pretrain_repeat_paraphrase','chunping-hf/my_audio','buzzcraft/ELI5-NO','racheltong/VA_test1','evanfrick/human_eval','Toastmachine/Pinescript-test','gowitheflowlab/parallel-pt-nl-pl','DeepFoldProtein/SCOP-1.65_processed_512','yn01/test_20240109_01','MysticMss/EUVOZ','Mashengshuaiqi/myfirstdataset','manishiitg/en-hi-raw','wjwow/FreeMan','thanhtlx/test-fix-cmg-time-split','yimingzhang/uf_safe_v1','WJYBUPT/law_item','iwasjohnlennon/JayAraeEssexArchive','causalnlp/corr2cause','htryj/instruction','xwjzds/paraphrase_collections_enhanced','cj-mills/cvat-instance-segmentation-toy-dataset','Shakib75/cpp-programs','Crysiss/lawdataset','ayushtues/instaflow_images','porkuaranha/joba','phamtungthuy/cauhoiphapluat','azrai99/data-scientist-jobstreet-dataset','mark434/combined','jksheth/r_j5','reciprocate/pku_safer_dpo_pairs','version-control/ds-lib-extract-1m','casecrit/2024-indonesian-election','SofiaVouzika/test-liver','senhorsapo/vanelope','AshanGimhana/Testingdata','focia/image_shot_dataset','VietAI/vi_mednli','philschmid/trl-test-instruction','feazer/nva-WRSPGR','Berzerker/gnhk_ocr_dataset','razent/vi_pubmed_small','Recag/Rp_C4_50']},
    kind: "dataset-split-names"
}, {dataset: 1, updated_at: 1})
{ _id: ObjectId("659d85dc137f88fd4461b89b"),
  dataset: 'feazer/nva-WRSPGR',
  updated_at: 2024-01-09T17:43:56.666Z }
severo commented 10 months ago

The entries were created between 2024-01-09T17:43 and 2024-01-10T20:07. It's somewhat old. Let's refresh all of them, and look in the next days if it appears again.

db.cachedResponsesBlue.aggregate([
    {$match: {
        dataset: {$in: ['speed1/nattan','wikimedia/wikipedia','bbaw_egyptian','joey234/mmlu-electrical_engineering-neg-prepend-verbal','imdatta0/ultrachat_1k','CyberHarem/fang_arknights','focia/private_instagram','DeepFoldProtein/CATH_v4.3_S35_processed_512_test','Vivek1234321/multi-cloud-train','fu1995/shuimo-image-dataset','adi-kmt/airoboros-3.2_kn','Gbssreejith/death_type_42_dataset','GunA-SD/DataX','0x7194633/persona-data-v1','Kalfrin/edset','Denilsonic/Samples','barolr/text_am-sum','sergei202/nexus-function-calling','xwjzds/pretrain_repeat_paraphrase','chunping-hf/my_audio','buzzcraft/ELI5-NO','racheltong/VA_test1','evanfrick/human_eval','Toastmachine/Pinescript-test','gowitheflowlab/parallel-pt-nl-pl','DeepFoldProtein/SCOP-1.65_processed_512','yn01/test_20240109_01','MysticMss/EUVOZ','Mashengshuaiqi/myfirstdataset','manishiitg/en-hi-raw','wjwow/FreeMan','thanhtlx/test-fix-cmg-time-split','yimingzhang/uf_safe_v1','WJYBUPT/law_item','iwasjohnlennon/JayAraeEssexArchive','causalnlp/corr2cause','htryj/instruction','xwjzds/paraphrase_collections_enhanced','cj-mills/cvat-instance-segmentation-toy-dataset','Shakib75/cpp-programs','Crysiss/lawdataset','ayushtues/instaflow_images','porkuaranha/joba','phamtungthuy/cauhoiphapluat','azrai99/data-scientist-jobstreet-dataset','mark434/combined','jksheth/r_j5','reciprocate/pku_safer_dpo_pairs','version-control/ds-lib-extract-1m','casecrit/2024-indonesian-election','SofiaVouzika/test-liver','senhorsapo/vanelope','AshanGimhana/Testingdata','focia/image_shot_dataset','VietAI/vi_mednli','philschmid/trl-test-instruction','feazer/nva-WRSPGR','Berzerker/gnhk_ocr_dataset','razent/vi_pubmed_small','Recag/Rp_C4_50']}
    }},
    {$group: {
        _id: 'dates',
        first: {$min: '$updated_at'},
        last: {$max: '$updated_at'},
    }}
])
{ _id: 'dates',
  first: 2024-01-09T17:43:56.666Z,
  last: 2024-01-10T20:07:49.555Z }
severo commented 10 months ago

Refreshing with:

HF_TOKEN=...
DATASETS=(speed1/nattan wikimedia/wikipedia bbaw_egyptian joey234/mmlu-electrical_engineering-neg-prepend-verbal imdatta0/ultrachat_1k CyberHarem/fang_arknights focia/private_instagram DeepFoldProtein/CATH_v4.3_S35_processed_512_test Vivek1234321/multi-cloud-train fu1995/shuimo-image-dataset adi-kmt/airoboros-3.2_kn Gbssreejith/death_type_42_dataset GunA-SD/DataX 0x7194633/persona-data-v1 Kalfrin/edset Denilsonic/Samples barolr/text_am-sum sergei202/nexus-function-calling xwjzds/pretrain_repeat_paraphrase chunping-hf/my_audio buzzcraft/ELI5-NO racheltong/VA_test1 evanfrick/human_eval Toastmachine/Pinescript-test gowitheflowlab/parallel-pt-nl-pl DeepFoldProtein/SCOP-1.65_processed_512 yn01/test_20240109_01 MysticMss/EUVOZ Mashengshuaiqi/myfirstdataset manishiitg/en-hi-raw wjwow/FreeMan thanhtlx/test-fix-cmg-time-split yimingzhang/uf_safe_v1 WJYBUPT/law_item iwasjohnlennon/JayAraeEssexArchive causalnlp/corr2cause htryj/instruction xwjzds/paraphrase_collections_enhanced cj-mills/cvat-instance-segmentation-toy-dataset Shakib75/cpp-programs Crysiss/lawdataset ayushtues/instaflow_images porkuaranha/joba phamtungthuy/cauhoiphapluat azrai99/data-scientist-jobstreet-dataset mark434/combined jksheth/r_j5 reciprocate/pku_safer_dpo_pairs version-control/ds-lib-extract-1m casecrit/2024-indonesian-election SofiaVouzika/test-liver senhorsapo/vanelope AshanGimhana/Testingdata focia/image_shot_dataset VietAI/vi_mednli philschmid/trl-test-instruction feazer/nva-WRSPGR Berzerker/gnhk_ocr_dataset razent/vi_pubmed_small Recag/Rp_C4_50)
for dataset in ${DATASETS[@]}; do curl -H "Authorization: Bearer $HF_TOKEN" -X POST https://datasets-server.huggingface.co/admin/force-refresh/dataset-config-names\?dataset\=$dataset\&priority\=low ; done;
severo commented 10 months ago

It worked for https://huggingface.co/datasets/causalnlp/corr2cause.

Capture d’écran 2024-01-11 à 10 21 28
severo commented 10 months ago

I ran it again:

db.cachedResponsesBlue.aggregate([
    {$group: {
        _id: "$dataset",
        count: {$sum: 1}
    }},
    {$match: {count: 1}}
])
{ _id: 'asakara/b', count: 1 }
{ _id: 'jbilcke-hf/ai-tube-index', count: 1 }
{ _id: 'feazer/nva-WRSPGR', count: 1 }
db.cachedResponsesBlue.aggregate([
    {$match: {
        dataset: {$in: ['asakara/b', 'jbilcke-hf/ai-tube-index', 'feazer/nva-WRSPGR']}
    }},
    {$group: {
        _id: 'dates',
        first: {$min: '$updated_at'},
        last: {$max: '$updated_at'},
    }}
])
{ _id: 'dates',
  first: 2024-01-09T17:43:56.666Z,
  last: 2024-01-11T11:31:35.885Z }

No job for these datasets:

db.jobsBlue.find({dataset: {$in: ['asakara/b', 'jbilcke-hf/ai-tube-index', 'feazer/nva-WRSPGR']}})
severo commented 10 months ago

I tried to recreate them manually (admin UI):

So: no more cases are reported at the moment.

db.cachedResponsesBlue.aggregate([
    {$group: {
        _id: "$dataset",
        count: {$sum: 1}
    }},
    {$match: {count: 1}}
])
severo commented 10 months ago

Today, no occurrence:

db.cachedResponsesBlue.aggregate([
    {$group: {
        _id: "$dataset",
        count: {$sum: 1}
    }},
    {$match: {count: 1}}
])
severo commented 10 months ago

Also reported here: https://huggingface.co/datasets/ayymen/Weblate-Translations/discussions/1

Capture d’écran 2024-01-15 à 10 58 52
severo commented 10 months ago

Current occurrences:

db.cachedResponsesBlue.aggregate([
    {$group: {
        _id: "$dataset",
        count: {$sum: 1}
    }},
    {$match: {count: 1}}
])
{ _id: 'CyberHarem/golden_hind_azurlane', count: 1 }
{ _id: 'kenhktsui/open-toolformer-retrieval-multi-neg-result-new-kw', count: 1 }
{ _id: 'CyberHarem/miyu_edelfelt_fgo', count: 1 }
{ _id: 'cutterd/gelgen_tar_29', count: 1 }
{ _id: 'Leogrin/real-toxicity-prompts_first_5K', count: 1 }
{ _id: 'CyberHarem/ak_47_girlsfrontline', count: 1 }
severo commented 10 months ago

As of today:

db.cachedResponsesBlue.aggregate([
    {$group: {
        _id: "$dataset",
        count: {$sum: 1}
    }},
    {$match: {count: 1}}
])
{ _id: 'cutterd/gelgen_tar_29', count: 1 }
{ _id: 'CyberHarem/miyu_edelfelt_fgo', count: 1 }
{ _id: 'CyberHarem/golden_hind_azurlane', count: 1 }
{ _id: 'kenhktsui/open-toolformer-retrieval-multi-neg-result-new-kw', count: 1 }
{ _id: 'Leogrin/real-toxicity-prompts_first_5K', count: 1 }
{ _id: 'Recag/Rg_CommonC_234', count: 1 }
{ _id: 'CyberHarem/roma_kantaicollection', count: 1 }
{ _id: 'CyberHarem/ak_47_girlsfrontline', count: 1 }

Two new ones: Recag/Rg_CommonC_234 and CyberHarem/roma_kantaicollection, and the existing ones have not been fixed by the backfill cronjob.

severo commented 9 months ago

Today:

db.cachedResponsesBlue.aggregate([
    {$group: {
        _id: "$dataset",
        count: {$sum: 1}
    }},
    {$match: {count: 1}}
])
{ _id: 'red_caps', count: 1 }
{ _id: 'Recag/Rp_CommonC_241', count: 1 }
{ _id: 'arbml/alpagasus_cleaned_ar_reviewed_v4', count: 1 }
{ _id: 'anandhuvasudev/guanaco-llama2-1k', count: 1 }
{ _id: 'CyberHarem/ak_47_girlsfrontline', count: 1 }
{ _id: 'hkust-nlp/agentboard', count: 1 }
{ _id: 'anandhuvasudev/southindiandish', count: 1 }
{ _id: '203427as321/articles', count: 1 }
{ _id: 'cdt', count: 1 }
{ _id: 'malucoelhaofc/NathanPortuguese', count: 1 }
{ _id: 'CyberHarem/roma_kantaicollection', count: 1 }
{ _id: 'asgaardlab/GamePhysicsDailyDump', count: 1 }
{ _id: 'GaJoPrograma/datasetVictoriaUNADGenericoDuplicados', count: 1 }
{ _id: 'YANG-Cheng/ab', count: 1 }
{ _id: 'oknerazan/english_sentences', count: 1 }
{ _id: 'Benchmbn/example1', count: 1 }
{ _id: 'Leogrin/real-toxicity-prompts_first_5K', count: 1 }
{ _id: 'DucHaiten/all-in', count: 1 }
{ _id: 'uyentk/thucuc_data', count: 1 }
{ _id: 'openclimatefix/dwd-icon-global', count: 1 }
{ _id: 'giux78/ultrafeedback-binarized-preferences-cleaned-ita-ready', count: 1 }
{ _id: 'jacobbieker/himawari9-kerchunk', count: 1 }
{ _id: 'openclimatefix/eumetsat-iodc', count: 1 }
{ _id: 'cutterd/gelgen_tar_29', count: 1 }
{ _id: 'Recag/Rp_CommonC_355', count: 1 }
{ _id: 'cedr', count: 1 }
{ _id: 'jacobbieker/eumetsat-iodc', count: 1 }
{ _id: 'Recag/Rp_CommonC_520', count: 1 }
{ _id: 'hf-doc-build/doc-build', count: 1 }
{ _id: 'CyberHarem/miyu_edelfelt_fgo', count: 1 }
{ _id: 'CyberHarem/golden_hind_azurlane', count: 1 }
{ _id: 'kenhktsui/open-toolformer-retrieval-multi-neg-result-new-kw', count: 1 }
{ _id: 'x_stance', count: 1 }

33 datasets

But we currently have a lot of pending jobs, so, it might be the reason.

Checking if some of them don't have a job (if they have jobs, we only have to wait):

use datasets_server_queue
db.jobsBlue.aggregate([
    {$match: {dataset: {$in: ['red_caps','Recag/Rp_CommonC_241','arbml/alpagasus_cleaned_ar_reviewed_v4','anandhuvasudev/guanaco-llama2-1k','CyberHarem/ak_47_girlsfrontline','hkust-nlp/agentboard','anandhuvasudev/southindiandish','203427as321/articles','cdt','malucoelhaofc/NathanPortuguese','CyberHarem/roma_kantaicollection','asgaardlab/GamePhysicsDailyDump','GaJoPrograma/datasetVictoriaUNADGenericoDuplicados','YANG-Cheng/ab','oknerazan/english_sentences','Benchmbn/example1','Leogrin/real-toxicity-prompts_first_5K','DucHaiten/all-in','uyentk/thucuc_data','openclimatefix/dwd-icon-global','giux78/ultrafeedback-binarized-preferences-cleaned-ita-ready','jacobbieker/himawari9-kerchunk','openclimatefix/eumetsat-iodc','cutterd/gelgen_tar_29','Recag/Rp_CommonC_355','cedr','jacobbieker/eumetsat-iodc','Recag/Rp_CommonC_520','hf-doc-build/doc-build','CyberHarem/miyu_edelfelt_fgo','CyberHarem/golden_hind_azurlane','kenhktsui/open-toolformer-retrieval-multi-neg-result-new-kw','x_stance']}}},
    {$group: {
        _id: "$dataset",
        count: {$sum: 1}
    }}
])
{ _id: 'giux78/ultrafeedback-binarized-preferences-cleaned-ita-ready',
  count: 13 }
{ _id: 'hf-doc-build/doc-build', count: 6 }
{ _id: 'oknerazan/english_sentences', count: 8 }
{ _id: 'uyentk/thucuc_data', count: 41 }
{ _id: 'GaJoPrograma/datasetVictoriaUNADGenericoDuplicados',
  count: 8 }
{ _id: 'cdt', count: 8 }
{ _id: 'Benchmbn/example1', count: 9 }
{ _id: 'openclimatefix/dwd-icon-global', count: 1 }
{ _id: 'Recag/Rp_CommonC_355', count: 3 }
{ _id: 'malucoelhaofc/NathanPortuguese', count: 8 }
{ _id: 'Recag/Rp_CommonC_520', count: 8 }
{ _id: 'cedr', count: 10 }
{ _id: 'hkust-nlp/agentboard', count: 22 }
{ _id: 'DucHaiten/all-in', count: 8 }
{ _id: 'arbml/alpagasus_cleaned_ar_reviewed_v4', count: 8 }
{ _id: 'anandhuvasudev/guanaco-llama2-1k', count: 1 }
{ _id: '203427as321/articles', count: 9 }
{ _id: 'jacobbieker/eumetsat-iodc', count: 6 }
{ _id: 'YANG-Cheng/ab', count: 24 }
{ _id: 'jacobbieker/himawari9-kerchunk', count: 1 }

25 have jobs (it took some minutes between the two commands, so, some datasets might have disappeared from the first command). Let's wait until the number of jobs has come back to normality, it's too hard to discriminate between normal cases and problematic ones.

severo commented 9 months ago

Today:

{ _id: 'CyberHarem/roma_kantaicollection', count: 1 }
{ _id: 'red_caps', count: 1 }
{ _id: 'CyberHarem/ak_47_girlsfrontline', count: 1 }
{ _id: 'Recag/Rp_CommonC_241', count: 1 }
{ _id: 'cutterd/gelgen_tar_29', count: 1 }
{ _id: 'x_stance', count: 1 }
{ _id: 'CyberHarem/golden_hind_azurlane', count: 1 }
{ _id: 'kenhktsui/open-toolformer-retrieval-multi-neg-result-new-kw', count: 1 }
{ _id: 'CyberHarem/miyu_edelfelt_fgo', count: 1 }
{ _id: 'Leogrin/real-toxicity-prompts_first_5K', count: 1 }

They all have been computed more than one day ago, and have not been backfilled (or deleted) since.

It's not clear why. They don't have common characteristics that could help finding a reason.

The last one ('Leogrin/real-toxicity-prompts_first_5K') does not exist anymore on the Hub, but the cache entry has not been deleted. Maybe it has been deleted after the last backfill.

For reference, the last backfill gave:

93387 analyzed datasets (total: 93387 datasets): 3 datasets have been deleted (0.00%), 0 datasets raised an exception (0.00%)

And it processed these datasets apparently without an error:

message
"INFO: 2024-01-18 23:12:29,605 - root - Analyzing cutterd/gelgen_tar_29"
"DEBUG: 2024-01-18 23:12:29,605 - urllib3.connectionpool - https://huggingface.co:443 ""GET /api/datasets/cutterd/gelgen_tar_29 HTTP/1.1"" 200 557"
"INFO: 2024-01-18 23:12:29,617 - root - Setting new revision to cutterd/gelgen_tar_29"

Let's look at the workers logs: no log for cutterd/gelgen_tar_29 and there is no job for it either. So: at some point in libcommon.orchestrator.set_revision(), we silently exited.

Possibly, DatasetBackfillPlan does nothing (and might even delete existing jobs) if a dataset has only one entry.

            plan = DatasetBackfillPlan(
                dataset=dataset,
                revision=revision,
                priority=priority,
                processing_graph=processing_graph,
                only_first_processing_steps=True,
            )
severo commented 9 months ago

Today: 0 occurrences, as expected