Closed severo closed 1 year ago
see also #1366
Is it something that could be validated here? https://github.com/huggingface/datasets-server/blob/main/services/worker/src/worker/job_runners/dataset/config_names.py#L63 (Raise an exception if the dataset has more than N configs and avoid running children jobs) or should we try to process first N configs?
Exactly, we should raise an error here
Some stats:
> db.cachedResponsesBlue.aggregate([{$match: {kind: "config-parquet-and-info"}}, {$group: {_id: "$dataset", count: {$sum: 1}}}, {$sort: {count: -1}}, {$limit: 20}])
{ _id: 'Muennighoff/flores200', count: 12881 }
{ _id: 'facebook/flores', count: 10719 }
{ _id: 'allenai/nllb', count: 2656 }
{ _id: 'yhavinga/ccmatrix', count: 2394 }
{ _id: 'vialibre/splittedspanish3bwc', count: 1751 }
{ _id: 'nanom/splittedspanish3bwc', count: 1751 }
{ _id: 'red_caps', count: 1742 }
{ _id: 'lmqg/qa_squadshifts_synthetic_random', count: 1500 }
{ _id: 'Helsinki-NLP/tatoeba_mt', count: 824 }
{ _id: 'Zaid/tatoeba_mt', count: 824 }
{ _id: 'bigscience/P3', count: 660 }
{ _id: 'BigScience/P3', count: 660 }
{ _id: 'Muennighoff/multi_eurlex', count: 530 }
{ _id: 'codeparrot/github-code', count: 496 }
{ _id: 'codeparrot/github-code-clean', count: 496 }
{ _id: 'adamlin/daily_dialog', count: 485 }
{ _id: 'thewall/jolma_unique', count: 461 }
{ _id: 'thewall/jolma', count: 461 }
{ _id: 'CodedotAI/code_clippy_github', count: 384 }
{ _id: 'sil-ai/bloom-captioning', count: 376 }
we could set the limit at 3000
Dataset https://huggingface.co/datasets/Muennighoff/flores200 has more than 40,000 configs. It's too much for our infrastructure for now. We should set a limit on it.