huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.6k stars 2.55k forks source link

DataFilesNotFoundError for datasets in the open-llm-leaderboard #6866

Closed jerome-white closed 2 weeks ago

jerome-white commented 4 weeks ago

Describe the bug

When trying to get config names or load any dataset within the open-llm-leaderboard ecosystem (open-llm-leaderboard/details_) I receive the DataFilesNotFoundError. For the last month or so I've been loading datasets from the leaderboard almost everyday; yesterday was the first time I started seeing this.

Steps to reproduce the bug

This snippet has three cells:

  1. Loads the modules
  2. Tries to get config names
  3. Tries to load the dataset

I've chosen "davidkim205"'s Rhea-72b-v0.5 model because it is one of the best performers on the leaderboard should likely have no dataset issues:

In [1]: from datasets import load_dataset, get_dataset_config_names

In [2]: get_dataset_config_names("open-llm-leaderboard/details_davidkim205__Rhea
   ...: -72b-v0.5")
---------------------------------------------------------------------------
DataFilesNotFoundError                    Traceback (most recent call last)
Cell In[2], line 1
----> 1 get_dataset_config_names("open-llm-leaderboard/details_davidkim205__Rhea-72b-v0.5")

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/inspect.py:347, in get_dataset_config_names(path, revision, download_config, download_mode, dynamic_modules_path, data_files, **download_kwargs)
    291 def get_dataset_config_names(
    292     path: str,
    293     revision: Optional[Union[str, Version]] = None,
   (...)
    298     **download_kwargs,
    299 ):
    300     """Get the list of available config names for a particular dataset.
    301 
    302     Args:
   (...)
    345     ```
    346     """
--> 347     dataset_module = dataset_module_factory(
    348         path,
    349         revision=revision,
    350         download_config=download_config,
    351         download_mode=download_mode,
    352         dynamic_modules_path=dynamic_modules_path,
    353         data_files=data_files,
    354         **download_kwargs,
    355     )
    356     builder_cls = get_dataset_builder_class(dataset_module, dataset_name=os.path.basename(path))
    357     return list(builder_cls.builder_configs.keys()) or [
    358         dataset_module.builder_kwargs.get("config_name", builder_cls.DEFAULT_CONFIG_NAME or "default")
    359     ]

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/load.py:1821, in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, cache_dir, trust_remote_code, _require_default_config_name, _require_custom_configs, **download_kwargs)
   1812     return LocalDatasetModuleFactoryWithScript(
   1813         combined_path,
   1814         download_mode=download_mode,
   1815         dynamic_modules_path=dynamic_modules_path,
   1816         trust_remote_code=trust_remote_code,
   1817     ).get_module()
   1818 elif os.path.isdir(path):
   1819     return LocalDatasetModuleFactoryWithoutScript(
   1820         path, data_dir=data_dir, data_files=data_files, download_mode=download_mode
-> 1821     ).get_module()
   1822 # Try remotely
   1823 elif is_relative_path(path) and path.count("/") <= 1:

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/load.py:1039, in LocalDatasetModuleFactoryWithoutScript.get_module(self)
   1033     patterns = get_data_patterns(base_path)
   1034 data_files = DataFilesDict.from_patterns(
   1035     patterns,
   1036     base_path=base_path,
   1037     allowed_extensions=ALL_ALLOWED_EXTENSIONS,
   1038 )
-> 1039 module_name, default_builder_kwargs = infer_module_for_data_files(
   1040     data_files=data_files,
   1041     path=self.path,
   1042 )
   1043 data_files = data_files.filter_extensions(_MODULE_TO_EXTENSIONS[module_name])
   1044 # Collect metadata files if the module supports them

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/load.py:597, in infer_module_for_data_files(data_files, path, download_config)
    595     raise ValueError(f"Couldn't infer the same data file format for all splits. Got {split_modules}")
    596 if not module_name:
--> 597     raise DataFilesNotFoundError("No (supported) data files found" + (f" in {path}" if path else ""))
    598 return module_name, default_builder_kwargs

DataFilesNotFoundError: No (supported) data files found in open-llm-leaderboard/details_davidkim205__Rhea-72b-v0.5

In [3]: data = load_dataset("open-llm-leaderboard/details_davidkim205__Rhea-72b-
   ...: v0.5", "harness_winogrande_5")
---------------------------------------------------------------------------
DataFilesNotFoundError                    Traceback (most recent call last)
Cell In[3], line 1
----> 1 data = load_dataset("open-llm-leaderboard/details_davidkim205__Rhea-72b-v0.5", "harness_winogrande_5")

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/load.py:2587, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2582 verification_mode = VerificationMode(
   2583     (verification_mode or VerificationMode.BASIC_CHECKS) if not save_infos else VerificationMode.ALL_CHECKS
   2584 )
   2586 # Create a dataset builder
-> 2587 builder_instance = load_dataset_builder(
   2588     path=path,
   2589     name=name,
   2590     data_dir=data_dir,
   2591     data_files=data_files,
   2592     cache_dir=cache_dir,
   2593     features=features,
   2594     download_config=download_config,
   2595     download_mode=download_mode,
   2596     revision=revision,
   2597     token=token,
   2598     storage_options=storage_options,
   2599     trust_remote_code=trust_remote_code,
   2600     _require_default_config_name=name is None,
   2601     **config_kwargs,
   2602 )
   2604 # Return iterable dataset in case of streaming
   2605 if streaming:

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/load.py:2259, in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, token, use_auth_token, storage_options, trust_remote_code, _require_default_config_name, **config_kwargs)
   2257     download_config = download_config.copy() if download_config else DownloadConfig()
   2258     download_config.storage_options.update(storage_options)
-> 2259 dataset_module = dataset_module_factory(
   2260     path,
   2261     revision=revision,
   2262     download_config=download_config,
   2263     download_mode=download_mode,
   2264     data_dir=data_dir,
   2265     data_files=data_files,
   2266     cache_dir=cache_dir,
   2267     trust_remote_code=trust_remote_code,
   2268     _require_default_config_name=_require_default_config_name,
   2269     _require_custom_configs=bool(config_kwargs),
   2270 )
   2271 # Get dataset builder class from the processing script
   2272 builder_kwargs = dataset_module.builder_kwargs

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/load.py:1821, in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, cache_dir, trust_remote_code, _require_default_config_name, _require_custom_configs, **download_kwargs)
   1812     return LocalDatasetModuleFactoryWithScript(
   1813         combined_path,
   1814         download_mode=download_mode,
   1815         dynamic_modules_path=dynamic_modules_path,
   1816         trust_remote_code=trust_remote_code,
   1817     ).get_module()
   1818 elif os.path.isdir(path):
   1819     return LocalDatasetModuleFactoryWithoutScript(
   1820         path, data_dir=data_dir, data_files=data_files, download_mode=download_mode
-> 1821     ).get_module()
   1822 # Try remotely
   1823 elif is_relative_path(path) and path.count("/") <= 1:

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/load.py:1039, in LocalDatasetModuleFactoryWithoutScript.get_module(self)
   1033     patterns = get_data_patterns(base_path)
   1034 data_files = DataFilesDict.from_patterns(
   1035     patterns,
   1036     base_path=base_path,
   1037     allowed_extensions=ALL_ALLOWED_EXTENSIONS,
   1038 )
-> 1039 module_name, default_builder_kwargs = infer_module_for_data_files(
   1040     data_files=data_files,
   1041     path=self.path,
   1042 )
   1043 data_files = data_files.filter_extensions(_MODULE_TO_EXTENSIONS[module_name])
   1044 # Collect metadata files if the module supports them

File ~/open-llm-bda/venv/lib/python3.11/site-packages/datasets/load.py:597, in infer_module_for_data_files(data_files, path, download_config)
    595     raise ValueError(f"Couldn't infer the same data file format for all splits. Got {split_modules}")
    596 if not module_name:
--> 597     raise DataFilesNotFoundError("No (supported) data files found" + (f" in {path}" if path else ""))
    598 return module_name, default_builder_kwargs

DataFilesNotFoundError: No (supported) data files found in open-llm-leaderboard/details_davidkim205__Rhea-72b-v0.5

Expected behavior

No exceptions from get_dataset_config_names or load_dataset

Environment info

jerome-white commented 4 weeks ago

Potentially related:

albertvillanova commented 3 weeks ago

Hi @jerome-white, thnaks for reporting.

However, I cannot reproduce your issue:

>>> from datasets import get_dataset_config_names

>>> get_dataset_config_names("open-llm-leaderboard/details_davidkim205__Rhea-72b-v0.5")
['harness_arc_challenge_25',
 'harness_gsm8k_5',
 'harness_hellaswag_10',
 'harness_hendrycksTest_5',
 'harness_hendrycksTest_abstract_algebra_5',
 'harness_hendrycksTest_anatomy_5',
 'harness_hendrycksTest_astronomy_5',
 'harness_hendrycksTest_business_ethics_5',
 'harness_hendrycksTest_clinical_knowledge_5',
 'harness_hendrycksTest_college_biology_5',
 'harness_hendrycksTest_college_chemistry_5',
 'harness_hendrycksTest_college_computer_science_5',
 'harness_hendrycksTest_college_mathematics_5',
 'harness_hendrycksTest_college_medicine_5',
 'harness_hendrycksTest_college_physics_5',
 'harness_hendrycksTest_computer_security_5',
 'harness_hendrycksTest_conceptual_physics_5',
 'harness_hendrycksTest_econometrics_5',
 'harness_hendrycksTest_electrical_engineering_5',
 'harness_hendrycksTest_elementary_mathematics_5',
 'harness_hendrycksTest_formal_logic_5',
 'harness_hendrycksTest_global_facts_5',
 'harness_hendrycksTest_high_school_biology_5',
 'harness_hendrycksTest_high_school_chemistry_5',
 'harness_hendrycksTest_high_school_computer_science_5',
 'harness_hendrycksTest_high_school_european_history_5',
 'harness_hendrycksTest_high_school_geography_5',
 'harness_hendrycksTest_high_school_government_and_politics_5',
 'harness_hendrycksTest_high_school_macroeconomics_5',
 'harness_hendrycksTest_high_school_mathematics_5',
 'harness_hendrycksTest_high_school_microeconomics_5',
 'harness_hendrycksTest_high_school_physics_5',
 'harness_hendrycksTest_high_school_psychology_5',
 'harness_hendrycksTest_high_school_statistics_5',
 'harness_hendrycksTest_high_school_us_history_5',
 'harness_hendrycksTest_high_school_world_history_5',
 'harness_hendrycksTest_human_aging_5',
 'harness_hendrycksTest_human_sexuality_5',
 'harness_hendrycksTest_international_law_5',
 'harness_hendrycksTest_jurisprudence_5',
 'harness_hendrycksTest_logical_fallacies_5',
 'harness_hendrycksTest_machine_learning_5',
 'harness_hendrycksTest_management_5',
 'harness_hendrycksTest_marketing_5',
 'harness_hendrycksTest_medical_genetics_5',
 'harness_hendrycksTest_miscellaneous_5',
 'harness_hendrycksTest_moral_disputes_5',
 'harness_hendrycksTest_moral_scenarios_5',
 'harness_hendrycksTest_nutrition_5',
 'harness_hendrycksTest_philosophy_5',
 'harness_hendrycksTest_prehistory_5',
 'harness_hendrycksTest_professional_accounting_5',
 'harness_hendrycksTest_professional_law_5',
 'harness_hendrycksTest_professional_medicine_5',
 'harness_hendrycksTest_professional_psychology_5',
 'harness_hendrycksTest_public_relations_5',
 'harness_hendrycksTest_security_studies_5',
 'harness_hendrycksTest_sociology_5',
 'harness_hendrycksTest_us_foreign_policy_5',
 'harness_hendrycksTest_virology_5',
 'harness_hendrycksTest_world_religions_5',
 'harness_truthfulqa_mc_0',
 'harness_winogrande_5',
 'results']

Maybe it was just a temporary issue...

jerome-white commented 2 weeks ago

Maybe it was just a temporary issue...

Perhaps. I've changed my workflow to use the hub's HfFileSystem, so for now this is no longer a blocker for me. I'll reopen the issue if that changes.