huggingface / datasets

đŸ¤— The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

DataFilesNotFoundError for datasets `OpenMol/PubChemSFT` #7292

Closed xnuohz closed 3 days ago

xnuohz commented 5 days ago

Describe the bug

Cannot load the dataset https://huggingface.co/datasets/OpenMol/PubChemSFT

Steps to reproduce the bug

from datasets import load_dataset
dataset = load_dataset('OpenMol/PubChemSFT')

Expected behavior

---------------------------------------------------------------------------
DataFilesNotFoundError                    Traceback (most recent call last)
Cell In[7], [line 2](vscode-notebook-cell:?execution_count=7&line=2)
      [1](vscode-notebook-cell:?execution_count=7&line=1) from datasets import load_dataset
----> [2](vscode-notebook-cell:?execution_count=7&line=2) dataset = load_dataset('OpenMol/PubChemSFT')

File ~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2587, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   [2582](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2582) verification_mode = VerificationMode(
   [2583](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2583)     (verification_mode or VerificationMode.BASIC_CHECKS) if not save_infos else VerificationMode.ALL_CHECKS
   [2584](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2584) )
   [2586](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2586) # Create a dataset builder
-> [2587](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2587) builder_instance = load_dataset_builder(
   [2588](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2588)     path=path,
   [2589](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2589)     name=name,
   [2590](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2590)     data_dir=data_dir,
   [2591](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2591)     data_files=data_files,
   [2592](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2592)     cache_dir=cache_dir,
   [2593](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2593)     features=features,
   [2594](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2594)     download_config=download_config,
   [2595](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2595)     download_mode=download_mode,
   [2596](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2596)     revision=revision,
   [2597](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2597)     token=token,
   [2598](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2598)     storage_options=storage_options,
   [2599](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2599)     trust_remote_code=trust_remote_code,
   [2600](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2600)     _require_default_config_name=name is None,
   [2601](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2601)     **config_kwargs,
   [2602](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2602) )
   [2604](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2604) # Return iterable dataset in case of streaming
   [2605](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2605) if streaming:

File ~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2259, in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, token, use_auth_token, storage_options, trust_remote_code, _require_default_config_name, **config_kwargs)
   [2257](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2257)     download_config = download_config.copy() if download_config else DownloadConfig()
   [2258](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2258)     download_config.storage_options.update(storage_options)
-> [2259](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2259) dataset_module = dataset_module_factory(
   [2260](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2260)     path,
   [2261](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2261)     revision=revision,
   [2262](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2262)     download_config=download_config,
   [2263](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2263)     download_mode=download_mode,
   [2264](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2264)     data_dir=data_dir,
   [2265](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2265)     data_files=data_files,
   [2266](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2266)     cache_dir=cache_dir,
   [2267](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2267)     trust_remote_code=trust_remote_code,
   [2268](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2268)     _require_default_config_name=_require_default_config_name,
   [2269](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2269)     _require_custom_configs=bool(config_kwargs),
   [2270](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2270) )
   [2271](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2271) # Get dataset builder class from the processing script
   [2272](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:2272) builder_kwargs = dataset_module.builder_kwargs

File ~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1904, in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, cache_dir, trust_remote_code, _require_default_config_name, _require_custom_configs, **download_kwargs)
   [1902](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1902)     raise ConnectionError(f"Couldn't reach the Hugging Face Hub for dataset '{path}': {e1}") from None
   [1903](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1903) if isinstance(e1, (DataFilesNotFoundError, DatasetNotFoundError, EmptyDatasetError)):
-> [1904](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1904)     raise e1 from None
   [1905](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1905) if isinstance(e1, FileNotFoundError):
   [1906](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1906)     raise FileNotFoundError(
   [1907](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1907)         f"Couldn't find a dataset script at {relative_to_absolute_path(combined_path)} or any data file in the same directory. "
   [1908](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1908)         f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}"
   [1909](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1909)     ) from None

File ~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1885, in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, cache_dir, trust_remote_code, _require_default_config_name, _require_custom_configs, **download_kwargs)
   [1876](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1876)         return HubDatasetModuleFactoryWithScript(
   [1877](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1877)             path,
   [1878](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1878)             revision=revision,
   (...)
   [1882](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1882)             trust_remote_code=trust_remote_code,
   [1883](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1883)         ).get_module()
   [1884](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1884)     else:
-> [1885](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1885)         return HubDatasetModuleFactoryWithoutScript(
   [1886](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1886)             path,
   [1887](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1887)             revision=revision,
   [1888](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1888)             data_dir=data_dir,
   [1889](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1889)             data_files=data_files,
   [1890](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1890)             download_config=download_config,
   [1891](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1891)             download_mode=download_mode,
   [1892](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1892)         ).get_module()
   [1893](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1893) except Exception as e1:
   [1894](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1894)     # All the attempts failed, before raising the error we should check if the module is already cached
   [1895](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1895)     try:

File ~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1270, in HubDatasetModuleFactoryWithoutScript.get_module(self)
   [1263](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1263)     patterns = get_data_patterns(base_path, download_config=self.download_config)
   [1264](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1264) data_files = DataFilesDict.from_patterns(
   [1265](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1265)     patterns,
   [1266](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1266)     base_path=base_path,
   [1267](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1267)     allowed_extensions=ALL_ALLOWED_EXTENSIONS,
   [1268](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1268)     download_config=self.download_config,
   [1269](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1269) )
-> [1270](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1270) module_name, default_builder_kwargs = infer_module_for_data_files(
   [1271](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1271)     data_files=data_files,
   [1272](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1272)     path=self.name,
   [1273](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1273)     download_config=self.download_config,
   [1274](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1274) )
   [1275](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1275) data_files = data_files.filter_extensions(_MODULE_TO_EXTENSIONS[module_name])
   [1276](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:1276) # Collect metadata files if the module supports them

File ~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:597, in infer_module_for_data_files(data_files, path, download_config)
    [595](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:595)     raise ValueError(f"Couldn't infer the same data file format for all splits. Got {split_modules}")
    [596](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:596) if not module_name:
--> [597](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:597)     raise DataFilesNotFoundError("No (supported) data files found" + (f" in {path}" if path else ""))
    [598](https://file+.vscode-resource.vscode-cdn.net/home/ubuntu/Projects/notebook/~/Softwares/anaconda3/envs/pyg-dev/lib/python3.9/site-packages/datasets/load.py:598) return module_name, default_builder_kwargs

DataFilesNotFoundError: No (supported) data files found in OpenMol/PubChemSFT

Environment info

- `datasets` version: 3.1.0
- Platform: Linux-5.15.0-125-generic-x86_64-with-glibc2.31
- Python version: 3.9.18
- `huggingface_hub` version: 0.25.2
- PyArrow version: 18.0.0
- Pandas version: 2.0.3
- `fsspec` version: 2023.9.2
lhoestq commented 3 days ago

Hi ! If the dataset owner uses push_to_hub() instead of save_to_disk() and upload the local files it will fix the issue. Right now datasets sees the train/test/valid pickle files but they are not supported file formats.

lhoestq commented 3 days ago

Alternatively you can load the arrow file instead:

from datasets import load_dataset
dataset = load_dataset('OpenMol/PubChemSFT', data_files='stage1/*.arrow')
xnuohz commented 3 days ago

Thanks! I'll have a try.