Open lhoestq opened 5 months ago
The error is caused by malformed basenames of the files within the TARs:
15_Cohen_1-s2.0-S0929664620300449-gr3_lrg-b.png
becomes 15_Cohen_1-s2
as the grouping __key__
, and 0-S0929664620300449-gr3_lrg-b.png
as the additional key to be added to the example15_Cohen_1-s2.0-S0929664620300449-gr3_lrg-b
as the grouping __key__
, and png
as the additional key to be added to the exampleTo get the expected behavior, the basenames of the files within the TARs should be fixed so that they only contain a single dot, the one separating the file extension.
I reopen it because I think we should try to give a clearer error message with a specific error code.
For now, it's hard for the user to understand where the error comes from (not everybody knows the subtleties of the webdataset filename structure).
(we can transfer it to https://github.com/huggingface/dataset-viewer if it fits better there)
same with .jpg -> https://huggingface.co/datasets/ProGamerGov/synthetic-dataset-1m-dalle3-high-quality-captions
Error code: DatasetGenerationError
Exception: DatasetGenerationError
Message: An error occurred while generating the dataset
Traceback: Traceback (most recent call last):
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1748, in _prepare_split_single
for key, record in generator:
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 818, in wrapped
for item in generator(*args, **kwargs):
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/packaged_modules/webdataset/webdataset.py", line 109, in _generate_examples
example[field_name] = {"path": example["__key__"] + "." + field_name, "bytes": example[field_name]}
KeyError: 'jpg'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 1316, in compute_config_parquet_and_info_response
parquet_operations, partial = stream_convert_to_parquet(
File "/src/services/worker/src/worker/job_runners/config/parquet_and_info.py", line 909, in stream_convert_to_parquet
builder._prepare_split(
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1627, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/src/services/worker/.venv/lib/python3.9/site-packages/datasets/builder.py", line 1784, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
More details in the spec (https://docs.google.com/document/d/18OdLjruFNX74ILmgrdiCI9J1fQZuhzzRBCHV9URWto0/edit#heading=h.hkptaq2kct2s)
The prefix of a file is all directory components of the file plus the file name component up to the first “.” in the file name. The last extension (i.e., the portion after the last “.”) in a file name determines the file type.
Example: images17/image194.left.jpg images17/image194.right.jpg images17/image194.json images17/image12.left.jpg images17/image12.json images17/image12.right.jpg images3/image1459.left.jpg … When reading this with a WebDataset library, you would get the following two dictionaries back in sequence:
{ “__key__”: “images17/image194”, “left.jpg”: b”...”, “right.jpg”: b”...”, “json”: b”...”}
{ “__key__”: “images17/image12”, “left.jpg”: b”...”, “right.jpg”: b”...”, “json”: b”...”}
OK, the issue is different in the latter case: some files are suffixed as .jpeg
, and others as .jpg
:)
Is it a limitation of the webdataset format, or of the datasets library @lhoestq? And could we be able to give a clearer error?
reported at https://huggingface.co/datasets/tbone5563/tar_images/discussions/1