huggingface / datasets

πŸ€— The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.32k stars 2.7k forks source link

prebuilt dataset relies on `downloads/extracted` #5457

Open stas00 opened 1 year ago

stas00 commented 1 year ago

Describe the bug

I pre-built the dataset:

python -c 'import sys; from datasets import load_dataset; ds=load_dataset(sys.argv[1])' HuggingFaceM4/general-pmd-synthetic-testing

and it can be used just fine.

now I wipe out downloads/extracted and it no longer works.

rm -r ~/.cache/huggingface/datasets/downloads

That is I can still load it:

python -c 'import sys; from datasets import load_dataset; ds=load_dataset(sys.argv[1])' HuggingFaceM4/general-pmd-synthetic-testing
No config specified, defaulting to: general-pmd-synthetic-testing/100.unique
Found cached dataset general-pmd-synthetic-testing (/home/stas/.cache/huggingface/datasets/HuggingFaceM4___general-pmd-synthetic-testing/100.unique/1.1.1/86bc445e3e48cb5ef79de109eb4e54ff85b318cd55c3835c4ee8f86eae33d9d2)

but if I try to use it:

E               stderr: Traceback (most recent call last):
E               stderr:   File "/mnt/nvme0/code/huggingface/m4-master-6/m4/training/main.py", line 116, in <module>
E               stderr:     train_loader, val_loader = get_dataloaders(
E               stderr:   File "/mnt/nvme0/code/huggingface/m4-master-6/m4/training/dataset.py", line 170, in get_dataloaders
E               stderr:     train_loader = get_dataloader_from_config(
E               stderr:   File "/mnt/nvme0/code/huggingface/m4-master-6/m4/training/dataset.py", line 443, in get_dataloader_from_config
E               stderr:     dataloader = get_dataloader(
E               stderr:   File "/mnt/nvme0/code/huggingface/m4-master-6/m4/training/dataset.py", line 264, in get_dataloader
E               stderr:     is_pmd = "meta" in hf_dataset[0] and "source" in hf_dataset[0]
E               stderr:   File "/mnt/nvme0/code/huggingface/datasets-master/src/datasets/arrow_dataset.py", line 2601, in __getitem__
E               stderr:     return self._getitem(
E               stderr:   File "/mnt/nvme0/code/huggingface/datasets-master/src/datasets/arrow_dataset.py", line 2586, in _getitem
E               stderr:     formatted_output = format_table(
E               stderr:   File "/mnt/nvme0/code/huggingface/datasets-master/src/datasets/formatting/formatting.py", line 634, in format_table
E               stderr:     return formatter(pa_table, query_type=query_type)
E               stderr:   File "/mnt/nvme0/code/huggingface/datasets-master/src/datasets/formatting/formatting.py", line 406, in __call__
E               stderr:     return self.format_row(pa_table)
E               stderr:   File "/mnt/nvme0/code/huggingface/datasets-master/src/datasets/formatting/formatting.py", line 442, in format_row
E               stderr:     row = self.python_features_decoder.decode_row(row)
E               stderr:   File "/mnt/nvme0/code/huggingface/datasets-master/src/datasets/formatting/formatting.py", line 225, in decode_row
E               stderr:     return self.features.decode_example(row) if self.features else row
E               stderr:   File "/mnt/nvme0/code/huggingface/datasets-master/src/datasets/features/features.py", line 1846, in decode_example
E               stderr:     return {
E               stderr:   File "/mnt/nvme0/code/huggingface/datasets-master/src/datasets/features/features.py", line 1847, in <dictcomp>
E               stderr:     column_name: decode_nested_example(feature, value, token_per_repo_id=token_per_repo_id)
E               stderr:   File "/mnt/nvme0/code/huggingface/datasets-master/src/datasets/features/features.py", line 1304, in decode_nested_example
E               stderr:     return decode_nested_example([schema.feature], obj)
E               stderr:   File "/mnt/nvme0/code/huggingface/datasets-master/src/datasets/features/features.py", line 1296, in decode_nested_example
E               stderr:     if decode_nested_example(sub_schema, first_elmt) != first_elmt:
E               stderr:   File "/mnt/nvme0/code/huggingface/datasets-master/src/datasets/features/features.py", line 1309, in decode_nested_example
E               stderr:     return schema.decode_example(obj, token_per_repo_id=token_per_repo_id)
E               stderr:   File "/mnt/nvme0/code/huggingface/datasets-master/src/datasets/features/image.py", line 144, in decode_example
E               stderr:     image = PIL.Image.open(path)
E               stderr:   File "/home/stas/anaconda3/envs/py38-pt113/lib/python3.8/site-packages/PIL/Image.py", line 3092, in open
E               stderr:     fp = builtins.open(filename, "rb")
E               stderr: FileNotFoundError: [Errno 2] No such file or directory: '/mnt/nvme0/code/data/cache/huggingface/datasets/downloads/extracted/134227b9b94c4eccf19b205bf3021d4492d0227b9be6c2ddb6bf517d8d55a8cb/data/101/images_01.jpg'

Only if I wipe out the cached dir and rebuild then it starts working as download/extracted is back again with extracted files.

rm -r ~/.cache/huggingface/datasets/HuggingFaceM4___general-pmd-synthetic-testing
python -c 'import sys; from datasets import load_dataset; ds=load_dataset(sys.argv[1])' HuggingFaceM4/general-pmd-synthetic-testing

I think there are 2 issues here:

  1. why does it still rely on extracted files after arrow files were printed - did I do something incorrectly when creating this dataset?
  2. why doesn't the dataset know that it has been gutted and loads just fine? If it has a dependency on download/extracted then load_dataset should check if it's there and fail or force rebuilding. I am sure this could be a very expensive operation, so probably really solving #1 will not require this check. and this second item is probably an overkill. Other than perhaps if it had an optional check_consistency flag to do that.

Environment info

datasets@main

mariosasko commented 1 year ago

Hi!

This issue is due to our audio/image datasets not being self-contained. This allows us to save disk space (files are written only once) but also leads to the issues like this one. We plan to make all our datasets self-contained in Datasets 3.0.

In the meantime, you can run the following map to ensure your dataset is self-contained:

from datasets.table import embed_table_storage
# load_dataset ...
dset = dset.with_format("arrow")
dset.map(embed_table_storage, batched=True)
dset = dset.with_format("python")
stas00 commented 1 year ago

Understood. Thank you, Mario.

Perhaps the solution could be very simple - move the extracted files into the directory of the cached dataset? Which would make it self-contained already and won't require waiting for a new major release. Unless I'm missing some back-compat nuance.

But regardless if X relies on Y - it could check if Y is still there when loading X. so not checking full consistency but just the top-level directory it relies on.

deanAirre commented 2 weeks ago

Hello,

I also face some problem with prebuilt dataset that relies on the same directory on

.cache\\huggingface\\datasets\\downloads\\extracted\\b557ce52f22c65030869d849d199d7b3fd5af18b335143729c717d29f6221baa\\ADEChallengeData2016\\annotations\\training\\ADE_train_00000023.png'

The images exist but the training function somehow cannot reached it. Is this also related to the same problem?

Currently the directory map looked like this:


> (hf-pretrain38) C:\Users\Len\.cache\huggingface>tree
> Folder PATH listing
> C:.
> β”œβ”€β”€β”€datasets
> β”‚   β”œβ”€β”€β”€downloads
> β”‚   β”‚   └───extracted
> β”‚   β”‚       β”œβ”€β”€β”€64c6a0967481dbc192dceabeac06c02b47b992a106357d49e1916dfcdc23a2ea
> β”‚   β”‚       β”‚   └───release_test
> β”‚   β”‚       β”‚       └───testing
> β”‚   β”‚       └───b557ce52f22c65030869d849d199d7b3fd5af18b335143729c717d29f6221baa
> β”‚   β”‚           └───ADEChallengeData2016
> β”‚   β”‚               β”œβ”€β”€β”€annotations
> β”‚   β”‚               β”‚   β”œβ”€β”€β”€training
> β”‚   β”‚               β”‚   └───validation
> β”‚   β”‚               └───images
> β”‚   β”‚                   β”œβ”€β”€β”€training
> β”‚   β”‚                   └───validation
> β”‚   β”œβ”€β”€β”€parquet
> β”‚   β”‚   └───yelp_review_full-66f1f8c8d1a2da02
> β”‚   β”‚       └───0.0.0
> β”‚   β”‚           └───14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7
> β”‚   └───scene_parse_150
> β”‚       └───scene_parsing
> β”‚           └───1.0.0
> β”‚               └───d998c54e1b5c5bad12b4d2ec7e1a5f74eee4c153bc1b089a0001677ae9b3fd75
> β”œβ”€β”€β”€evaluate
> β”‚   └───downloads
> β”œβ”€β”€β”€hub
> β”‚   β”œβ”€β”€β”€.locks
> β”‚   β”‚   β”œβ”€β”€β”€datasets--scene_parse_150
> β”‚   β”‚   β”œβ”€β”€β”€models--facebook--mask2former-swin-large-cityscapes-instance
> β”‚   β”‚   β”œβ”€β”€β”€models--facebook--mask2former-swin-large-cityscapes-panoptic
> β”‚   β”‚   β”œβ”€β”€β”€models--nvidia--mit-b0
> β”‚   β”‚   └───models--nvidia--segformer-b1-finetuned-cityscapes-1024-1024
> β”‚   β”œβ”€β”€β”€datasets--huggingface--label-files
> β”‚   β”‚   β”œβ”€β”€β”€blobs
> β”‚   β”‚   β”œβ”€β”€β”€refs
> β”‚   β”‚   └───snapshots
> β”‚   β”‚       └───9462154cba99c3c7f569d3b4f1ba26614afd558c
> β”‚   β”œβ”€β”€β”€datasets--scene_parse_150
> β”‚   β”‚   β”œβ”€β”€β”€.no_exist
> β”‚   β”‚   β”‚   └───ac1c0c0e23875e74cd77aca0fd725fd6a35c3667
> β”‚   β”‚   β”œβ”€β”€β”€blobs
> β”‚   β”‚   β”œβ”€β”€β”€refs
> β”‚   β”‚   └───snapshots
> β”‚   β”‚       └───ac1c0c0e23875e74cd77aca0fd725fd6a35c3667
> β”‚   β”œβ”€β”€β”€models--bert-base-cased
> β”‚   β”‚   β”œβ”€β”€β”€.no_exist
> β”‚   β”‚   β”‚   └───cd5ef92a9fb2f889e972770a36d4ed042daf221e
> β”‚   β”‚   β”œβ”€β”€β”€blobs
> β”‚   β”‚   β”œβ”€β”€β”€refs
> β”‚   β”‚   └───snapshots
> β”‚   β”‚       └───cd5ef92a9fb2f889e972770a36d4ed042daf221e
> β”‚   β”œβ”€β”€β”€models--bert-case-cased
> β”‚   β”œβ”€β”€β”€models--facebook--detr-resnet-50-panoptic
> β”‚   β”‚   β”œβ”€β”€β”€blobs
> β”‚   β”‚   β”œβ”€β”€β”€refs
> β”‚   β”‚   └───snapshots
> β”‚   β”‚       └───d53b52a799403a8867920f82c869e40732b47037
> β”‚   β”œβ”€β”€β”€models--facebook--mask2former-swin-base-coco-panoptic
> β”‚   β”‚   β”œβ”€β”€β”€blobs
> β”‚   β”‚   β”œβ”€β”€β”€refs
> β”‚   β”‚   └───snapshots
> β”‚   β”‚       └───8351ef9576a965d65196da91a5015dcaf6c6b5d2
> β”‚   β”œβ”€β”€β”€models--facebook--mask2former-swin-large-cityscapes-instance
> β”‚   β”‚   β”œβ”€β”€β”€blobs
> β”‚   β”‚   β”œβ”€β”€β”€refs
> β”‚   β”‚   └───snapshots
> β”‚   β”‚       └───70fed72d02a138560da931a1c6a2dcfbb56cd2ff
> β”‚   β”œβ”€β”€β”€models--facebook--mask2former-swin-large-cityscapes-panoptic
> β”‚   β”‚   β”œβ”€β”€β”€blobs
> β”‚   β”‚   β”œβ”€β”€β”€refs
> β”‚   β”‚   └───snapshots
> β”‚   β”‚       └───544d76fe93971ee046dacae19b6d4f6ecb5d9088
> β”‚   β”œβ”€β”€β”€models--google_bert--bert-base-cased
> β”‚   β”œβ”€β”€β”€models--nvidia--mit-b0
> β”‚   β”‚   β”œβ”€β”€β”€.no_exist
> β”‚   β”‚   β”‚   └───80983a413c30d36a39c20203974ae7807835e2b4
> β”‚   β”‚   β”œβ”€β”€β”€blobs
> β”‚   β”‚   β”œβ”€β”€β”€refs
> β”‚   β”‚   β”‚   └───refs
> β”‚   β”‚   β”‚       └───pr
> β”‚   β”‚   └───snapshots
> β”‚   β”‚       β”œβ”€β”€β”€25ce79d97e6d9d509ed12e17cb2eb89b0a83a2dc
> β”‚   β”‚       └───80983a413c30d36a39c20203974ae7807835e2b4
> β”‚   β”œβ”€β”€β”€models--nvidia--segformer-b0-finetuned-cityscapes-768-768
> β”‚   β”‚   β”œβ”€β”€β”€blobs
> β”‚   β”‚   β”œβ”€β”€β”€refs
> β”‚   β”‚   └───snapshots
> β”‚   β”‚       └───d3b7801ed329668d5bff04cd33365fa37f538c3b
> β”‚   └───models--nvidia--segformer-b1-finetuned-cityscapes-1024-1024
> β”‚       β”œβ”€β”€β”€.no_exist
> β”‚       β”‚   └───ec86afeba68e656629ccf47e0c8d2902f964917b
> β”‚       β”œβ”€β”€β”€blobs
> β”‚       β”œβ”€β”€β”€refs
> β”‚       β”‚   └───refs
> β”‚       β”‚       └───pr
> β”‚       └───snapshots
> β”‚           β”œβ”€β”€β”€ad2bb0101129289844ea62577e6a22adc2752004
> β”‚           └───ec86afeba68e656629ccf47e0c8d2902f964917b
> β”œβ”€β”€β”€metrics
> β”‚   └───mean_io_u
> β”‚       └───default
> └───modules
>     β”œβ”€β”€β”€datasets_modules
>     β”‚   β”œβ”€β”€β”€datasets
>     β”‚   β”‚   β”œβ”€β”€β”€scene_parse_150
>     β”‚   β”‚   β”‚   β”œβ”€β”€β”€d998c54e1b5c5bad12b4d2ec7e1a5f74eee4c153bc1b089a0001677ae9b3fd75
>     β”‚   β”‚   β”‚   β”‚   └───__pycache__
>     β”‚   β”‚   β”‚   └───__pycache__
>     β”‚   β”‚   └───__pycache__
>     β”‚   └───__pycache__
>     └───evaluate_modules
>         β”œβ”€β”€β”€metrics
>         β”‚   β”œβ”€β”€β”€evaluate-metric--mean_iou
>         β”‚   β”‚   β”œβ”€β”€β”€9e450724f21f05592bfb0255fe2fa576df8171fa060d11121d8aecfff0db80d0
>         β”‚   β”‚   β”‚   └───__pycache__
>         β”‚   β”‚   └───__pycache__
>         β”‚   └───__pycache__
>         └───__pycache__

Will appreciate for some help and will help in completing further details, thanks in advance