Webdataset: KeyError: 'png' on some datasets when streaming

lhoestq commented 5 months ago

reported at https://huggingface.co/datasets/tbone5563/tar_images/discussions/1

>>> from datasets import load_dataset
>>> ds = load_dataset("tbone5563/tar_images")
Downloading data: 100%
 1.41G/1.41G [00:48<00:00, 17.2MB/s]
Downloading data: 100%
 619M/619M [00:11<00:00, 57.4MB/s]
Generating train split: 
 970/0 [00:02<00:00, 534.94 examples/s]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
   1747                 _time = time.time()
-> 1748                 for key, record in generator:
   1749                     if max_shard_size is not None and writer._num_bytes > max_shard_size:

7 frames
[/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/webdataset/webdataset.py](https://localhost:8080/#) in _generate_examples(self, tar_paths, tar_iterators)
    108                 for field_name in image_field_names + audio_field_names:
--> 109                     example[field_name] = {"path": example["__key__"] + "." + field_name, "bytes": example[field_name]}
    110                 yield f"{tar_idx}_{example_idx}", example

KeyError: 'png'

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
[<ipython-input-2-8e0fbb7badc9>](https://localhost:8080/#) in <cell line: 3>()
      1 from datasets import load_dataset
      2 
----> 3 ds = load_dataset("tbone5563/tar_images")

[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2607 
   2608     # Download and prepare data
-> 2609     builder_instance.download_and_prepare(
   2610         download_config=download_config,
   2611         download_mode=download_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
   1025                         if num_proc is not None:
   1026                             prepare_split_kwargs["num_proc"] = num_proc
-> 1027                         self._download_and_prepare(
   1028                             dl_manager=dl_manager,
   1029                             verification_mode=verification_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs)
   1787 
   1788     def _download_and_prepare(self, dl_manager, verification_mode, **prepare_splits_kwargs):
-> 1789         super()._download_and_prepare(
   1790             dl_manager,
   1791             verification_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
   1120             try:
   1121                 # Prepare split will record examples associated to the split
-> 1122                 self._prepare_split(split_generator, **prepare_split_kwargs)
   1123             except OSError as e:
   1124                 raise OSError(

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split(self, split_generator, check_duplicate_keys, file_format, num_proc, max_shard_size)
   1625             job_id = 0
   1626             with pbar:
-> 1627                 for job_id, done, content in self._prepare_split_single(
   1628                     gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   1629                 ):

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, split_info, check_duplicate_keys, job_id)
   1782             if isinstance(e, SchemaInferenceError) and e.__context__ is not None:
   1783                 e = e.__context__
-> 1784             raise DatasetGenerationError("An error occurred while generating the dataset") from e
   1785 
   1786         yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

albertvillanova commented 4 months ago

The error is caused by malformed basenames of the files within the TARs:

15_Cohen_1-s2.0-S0929664620300449-gr3_lrg-b.png becomes 15_Cohen_1-s2 as the grouping __key__, and 0-S0929664620300449-gr3_lrg-b.png as the additional key to be added to the example
whereas the intended behavior was to use 15_Cohen_1-s2.0-S0929664620300449-gr3_lrg-b as the grouping __key__, and png as the additional key to be added to the example

To get the expected behavior, the basenames of the files within the TARs should be fixed so that they only contain a single dot, the one separating the file extension.

severo commented 4 months ago

I reopen it because I think we should try to give a clearer error message with a specific error code.

For now, it's hard for the user to understand where the error comes from (not everybody knows the subtleties of the webdataset filename structure).

(we can transfer it to https://github.com/huggingface/dataset-viewer if it fits better there)

severo commented 4 months ago

same with .jpg -> https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions

Error code:   DatasetGenerationError
Exception:    DatasetGenerationError
Message:      An error occurred while generating the dataset
Traceback:    Traceback (most recent call last):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1748, in _prepare_split_single
                  for key, record in generator:
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 818, in wrapped
                  for item in generator(*args, **kwargs):
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/webdataset/webdataset.py", line 109, in _generate_examples
                  example[field_name] = {"path": example["__key__"] + "." + field_name, "bytes": example[field_name]}
              KeyError: 'jpg'

              The above exception was the direct cause of the following exception:

              Traceback (most recent call last):
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1316, in compute_config_parquet_and_info_response
                  parquet_operations, partial = stream_convert_to_parquet(
                File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 909, in stream_convert_to_parquet
                  builder._prepare_split(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1627, in _prepare_split
                  for job_id, done, content in self._prepare_split_single(
                File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1784, in _prepare_split_single
                  raise DatasetGenerationError("An error occurred while generating the dataset") from e
              datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

severo commented 4 months ago

More details in the spec (https://docs.google.com/document/d/18OdLjruFNX74ILmgrdiCI9J1fQZuhzzRBCHV9URWto0/edit#heading=h.hkptaq2kct2s)

The prefix of a file is all directory components of the file plus the file name component up to the first “.” in the file name. The last extension (i.e., the portion after the last “.”) in a file name determines the file type.

Example: images17/image194.left.jpg images17/image194.right.jpg images17/image194.json images17/image12.left.jpg images17/image12.json images17/image12.right.jpg images3/image1459.left.jpg … When reading this with a WebDataset library, you would get the following two dictionaries back in sequence:

    { “__key__”: “images17/image194”, “left.jpg”: b”...”, “right.jpg”: b”...”, “json”: b”...”}
    { “__key__”: “images17/image12”, “left.jpg”: b”...”, “right.jpg”: b”...”, “json”: b”...”}

severo commented 4 months ago

OK, the issue is different in the latter case: some files are suffixed as .jpeg, and others as .jpg :)

Is it a limitation of the webdataset format, or of the datasets library @lhoestq? And could we be able to give a clearer error?

huggingface / datasets

Webdataset: KeyError: 'png' on some datasets when streaming #6880