huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19k stars 2.63k forks source link

`load_dataset` from Hub with `name` to specify `config` using incorrect builder type when multiple data formats are present #7101

Open hlky opened 1 month ago

hlky commented 1 month ago

Following documentation I had defined different configs for Dataception, a dataset of datasets:

configs:
- config_name: dataception
  data_files:
  - path: dataception.parquet
    split: train
  default: true
- config_name: dataset_5423
  data_files:
  - path: datasets/5423.tar
    split: train
...
- config_name: dataset_721736
  data_files:
  - path: datasets/721736.tar
    split: train

The intent was for metadata to be browsable via Dataset Viewer, in addition to each individual dataset, and to allow datasets to be loaded by specifying the config/name to load_dataset.

While testing load_dataset I encountered the following error:

>>> dataset = load_dataset("bigdata-pw/Dataception", "dataset_7691")
Downloading readme: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 467k/467k [00:00<00:00, 1.99MB/s]
Downloading data: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 71.0M/71.0M [00:02<00:00, 26.8MB/s]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "datasets\load.py", line 2145, in load_dataset
    builder_instance.download_and_prepare(
  File "datasets\builder.py", line 1027, in download_and_prepare
    self._download_and_prepare(
  File "datasets\builder.py", line 1100, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "datasets\packaged_modules\parquet\parquet.py", line 58, in _split_generators
    self.info.features = datasets.Features.from_arrow_schema(pq.read_schema(f))
                                                             ^^^^^^^^^^^^^^^^^
  File "pyarrow\parquet\core.py", line 2325, in read_schema
    file = ParquetFile(
           ^^^^^^^^^^^^
  File "pyarrow\parquet\core.py", line 318, in __init__
    self.reader.open(
  File "pyarrow\_parquet.pyx", line 1470, in pyarrow._parquet.ParquetReader.open
  File "pyarrow\error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

The correct file is downloaded, however the incorrect builder type is detected; parquet due to other content of the repository. It would appear that the config needs to be taken into account.

Note that I have removed the additional configs from the repository because of this issue and there is a limit of 3000 configs anyway so the Dataset Viewer doesn't work as I intended. I'll add them back in if it assists with testing.

hlky commented 4 weeks ago

Having looked into this further it seems the core of the issue is with two different formats in the same repo.

When the parquet config is first, the WebDatasets are loaded as parquet, if the WebDataset configs are first, the parquet is loaded as WebDataset.

A workaround in my case would be to just turn the parquet into a WebDataset, although I'd still need the Dataset Viewer config limit increasing. In other cases using the same format may not be possible.

Relevant code: