huggingface / datasets

πŸ€— The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.26k stars 2.7k forks source link

Unable to load AutoTrain-generated dataset from the hub #5627

Open ijmiller2 opened 1 year ago

ijmiller2 commented 1 year ago

Describe the bug

DatasetGenerationError: An error occurred while generating the dataset -> ValueError: Couldn't cast ... because column names don't match

ValueError: Couldn't cast
_data_files: list<item: struct<filename: string>>
  child 0, item: struct<filename: string>
      child 0, filename: string
_fingerprint: string
_format_columns: list<item: string>
  child 0, item: string
_format_kwargs: struct<>
_format_type: null
_indexes: struct<>
_output_all_columns: bool
_split: null
to
{'citation': Value(dtype='string', id=None), 'description': Value(dtype='string', id=None), 'features': {'image': {'_type': Value(dtype='string', id=None)}, 'target': {'names': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), '_type': Value(dtype='string', id=None)}}, 'homepage': Value(dtype='string', id=None), 'license': Value(dtype='string', id=None), 'splits': {'train': {'name': Value(dtype='string', id=None), 'num_bytes': Value(dtype='int64', id=None), 'num_examples': Value(dtype='int64', id=None), 'dataset_name': Value(dtype='null', id=None)}}}
because column names don't match

Steps to reproduce the bug

Steps to reproduce:

  1. pip install datasets==2.10.1
  2. Attempt to load (private dataset). Note that I'm authenticated via huggingface-cli login
from datasets import load_dataset

# load dataset
dataset = "ijmiller2/autotrain-data-betterbin-vision-10000"
dataset = load_dataset(dataset)

Here's the full traceback:

Downloading data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 2383.80it/s]
Extracting data files: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:00<00:00, 505.95it/s]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
File ~/anaconda3/envs/betterbin/lib/python3.8/site-packages/datasets/builder.py:1874, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1868     writer = writer_class(
   1869         features=writer._features,
   1870         path=fpath.replace("SSSSS", f"{shard_id:05d}").replace("JJJJJ", f"{job_id:05d}"),
   1871         storage_options=self._fs.storage_options,
   1872         embed_local_files=embed_local_files,
   1873     )
-> 1874 writer.write_table(table)
   1875 num_examples_progress_update += len(table)

File ~/anaconda3/envs/betterbin/lib/python3.8/site-packages/datasets/arrow_writer.py:568, in ArrowWriter.write_table(self, pa_table, writer_batch_size)
    567 pa_table = pa_table.combine_chunks()
--> 568 pa_table = table_cast(pa_table, self._schema)
    569 if self.embed_local_files:

File ~/anaconda3/envs/betterbin/lib/python3.8/site-packages/datasets/table.py:2312, in table_cast(table, schema)
   2311 if table.schema != schema:
-> 2312     return cast_table_to_schema(table, schema)
   2313 elif table.schema.metadata != schema.metadata:

File ~/anaconda3/envs/betterbin/lib/python3.8/site-packages/datasets/table.py:2270, in cast_table_to_schema(table, schema)
   2269 if sorted(table.column_names) != sorted(features):
-> 2270     raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
   2271 arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]

ValueError: Couldn't cast
_data_files: list<item: struct<filename: string>>
  child 0, item: struct<filename: string>
      child 0, filename: string
_fingerprint: string
_format_columns: list<item: string>
  child 0, item: string
_format_kwargs: struct<>
_format_type: null
_indexes: struct<>
_output_all_columns: bool
_split: null
to
{'citation': Value(dtype='string', id=None), 'description': Value(dtype='string', id=None), 'features': {'image': {'_type': Value(dtype='string', id=None)}, 'target': {'names': Sequence(feature=Value(dtype='string', id=None), length=-1, id=None), '_type': Value(dtype='string', id=None)}}, 'homepage': Value(dtype='string', id=None), 'license': Value(dtype='string', id=None), 'splits': {'train': {'name': Value(dtype='string', id=None), 'num_bytes': Value(dtype='int64', id=None), 'num_examples': Value(dtype='int64', id=None), 'dataset_name': Value(dtype='null', id=None)}}}
because column names don't match

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
Input In [8], in <cell line: 6>()
      4 # load dataset
      5 dataset = "ijmiller2/autotrain-data-betterbin-vision-10000"
----> 6 dataset = load_dataset(dataset)

File ~/anaconda3/envs/betterbin/lib/python3.8/site-packages/datasets/load.py:1782, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, **config_kwargs)
   1779 try_from_hf_gcs = path not in _PACKAGED_DATASETS_MODULES
   1781 # Download and prepare data
-> 1782 builder_instance.download_and_prepare(
   1783     download_config=download_config,
   1784     download_mode=download_mode,
   1785     verification_mode=verification_mode,
   1786     try_from_hf_gcs=try_from_hf_gcs,
   1787     num_proc=num_proc,
   1788 )
   1790 # Build dataset for splits
   1791 keep_in_memory = (
   1792     keep_in_memory if keep_in_memory is not None else is_small_dataset(builder_instance.info.dataset_size)
   1793 )

File ~/anaconda3/envs/betterbin/lib/python3.8/site-packages/datasets/builder.py:872, in DatasetBuilder.download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
    870     if num_proc is not None:
    871         prepare_split_kwargs["num_proc"] = num_proc
--> 872     self._download_and_prepare(
    873         dl_manager=dl_manager,
    874         verification_mode=verification_mode,
    875         **prepare_split_kwargs,
    876         **download_and_prepare_kwargs,
    877     )
    878 # Sync info
    879 self.info.dataset_size = sum(split.num_bytes for split in self.info.splits.values())

File ~/anaconda3/envs/betterbin/lib/python3.8/site-packages/datasets/builder.py:967, in DatasetBuilder._download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
    963 split_dict.add(split_generator.split_info)
    965 try:
    966     # Prepare split will record examples associated to the split
--> 967     self._prepare_split(split_generator, **prepare_split_kwargs)
    968 except OSError as e:
    969     raise OSError(
    970         "Cannot find data file. "
    971         + (self.manual_download_instructions or "")
    972         + "\nOriginal error:\n"
    973         + str(e)
    974     ) from None

File ~/anaconda3/envs/betterbin/lib/python3.8/site-packages/datasets/builder.py:1749, in ArrowBasedBuilder._prepare_split(self, split_generator, file_format, num_proc, max_shard_size)
   1747 job_id = 0
   1748 with pbar:
-> 1749     for job_id, done, content in self._prepare_split_single(
   1750         gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   1751     ):
   1752         if done:
   1753             result = content

File ~/anaconda3/envs/betterbin/lib/python3.8/site-packages/datasets/builder.py:1892, in ArrowBasedBuilder._prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1890     if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
   1891         e = e.__context__
-> 1892     raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1894 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

Expected behavior

I'm ultimately trying to generate my own performance metrics on validation data (before putting an endpoint into production) and so was hoping to load all or at least the validation subset from the hub.

I'm expecting the load_dataset() function to work as shown in the documentation here:

dataset = load_dataset(
  "lhoestq/custom_squad",
  revision="main"  # tag name, or branch name, or commit hash
)

Environment info

lhoestq commented 1 year ago

The AutoTrain format is not supported right now. I think it would require a dedicated dataset builder

ijmiller2 commented 1 year ago

Okay, good to know. Thanks for the reply. For now I will just have to manage the split manually before training, because I can’t find any way of pulling out file indices or file names from the autogenerated split. The file names field of the image dataset (loaded directly from arrow file) is missing, just fyi (for anyone else this might be relevant too).

On Fri, Mar 10, 2023 at 7:02 PM Quentin Lhoest @.***> wrote:

The AutoTrain format is not supported right now. I think it would require a dedicated dataset builder

β€” Reply to this email directly, view it on GitHub https://github.com/huggingface/datasets/issues/5627#issuecomment-1464734308, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACBJ4F5A353MCZ76OGRJ6CTW3PFI7ANCNFSM6AAAAAAVWXNUTE . You are receiving this because you authored the thread.Message ID: @.***>