Open stas00 opened 1 year ago
Hi!
This issue is due to our audio/image datasets not being self-contained. This allows us to save disk space (files are written only once) but also leads to the issues like this one. We plan to make all our datasets self-contained in Datasets 3.0.
In the meantime, you can run the following map to ensure your dataset is self-contained:
from datasets.table import embed_table_storage
# load_dataset ...
dset = dset.with_format("arrow")
dset.map(embed_table_storage, batched=True)
dset = dset.with_format("python")
Understood. Thank you, Mario.
Perhaps the solution could be very simple - move the extracted files into the directory of the cached dataset? Which would make it self-contained already and won't require waiting for a new major release. Unless I'm missing some back-compat nuance.
But regardless if X relies on Y - it could check if Y is still there when loading X. so not checking full consistency but just the top-level directory it relies on.
Hello,
I also face some problem with prebuilt dataset that relies on the same directory on
.cache\\huggingface\\datasets\\downloads\\extracted\\b557ce52f22c65030869d849d199d7b3fd5af18b335143729c717d29f6221baa\\ADEChallengeData2016\\annotations\\training\\ADE_train_00000023.png'
The images exist but the training function somehow cannot reached it. Is this also related to the same problem?
Currently the directory map looked like this:
> (hf-pretrain38) C:\Users\Len\.cache\huggingface>tree
> Folder PATH listing
> C:.
> ββββdatasets
> β ββββdownloads
> β β ββββextracted
> β β ββββ64c6a0967481dbc192dceabeac06c02b47b992a106357d49e1916dfcdc23a2ea
> β β β ββββrelease_test
> β β β ββββtesting
> β β ββββb557ce52f22c65030869d849d199d7b3fd5af18b335143729c717d29f6221baa
> β β ββββADEChallengeData2016
> β β ββββannotations
> β β β ββββtraining
> β β β ββββvalidation
> β β ββββimages
> β β ββββtraining
> β β ββββvalidation
> β ββββparquet
> β β ββββyelp_review_full-66f1f8c8d1a2da02
> β β ββββ0.0.0
> β β ββββ14a00e99c0d15a23649d0db8944380ac81082d4b021f398733dd84f3a6c569a7
> β ββββscene_parse_150
> β ββββscene_parsing
> β ββββ1.0.0
> β ββββd998c54e1b5c5bad12b4d2ec7e1a5f74eee4c153bc1b089a0001677ae9b3fd75
> ββββevaluate
> β ββββdownloads
> ββββhub
> β ββββ.locks
> β β ββββdatasets--scene_parse_150
> β β ββββmodels--facebook--mask2former-swin-large-cityscapes-instance
> β β ββββmodels--facebook--mask2former-swin-large-cityscapes-panoptic
> β β ββββmodels--nvidia--mit-b0
> β β ββββmodels--nvidia--segformer-b1-finetuned-cityscapes-1024-1024
> β ββββdatasets--huggingface--label-files
> β β ββββblobs
> β β ββββrefs
> β β ββββsnapshots
> β β ββββ9462154cba99c3c7f569d3b4f1ba26614afd558c
> β ββββdatasets--scene_parse_150
> β β ββββ.no_exist
> β β β ββββac1c0c0e23875e74cd77aca0fd725fd6a35c3667
> β β ββββblobs
> β β ββββrefs
> β β ββββsnapshots
> β β ββββac1c0c0e23875e74cd77aca0fd725fd6a35c3667
> β ββββmodels--bert-base-cased
> β β ββββ.no_exist
> β β β ββββcd5ef92a9fb2f889e972770a36d4ed042daf221e
> β β ββββblobs
> β β ββββrefs
> β β ββββsnapshots
> β β ββββcd5ef92a9fb2f889e972770a36d4ed042daf221e
> β ββββmodels--bert-case-cased
> β ββββmodels--facebook--detr-resnet-50-panoptic
> β β ββββblobs
> β β ββββrefs
> β β ββββsnapshots
> β β ββββd53b52a799403a8867920f82c869e40732b47037
> β ββββmodels--facebook--mask2former-swin-base-coco-panoptic
> β β ββββblobs
> β β ββββrefs
> β β ββββsnapshots
> β β ββββ8351ef9576a965d65196da91a5015dcaf6c6b5d2
> β ββββmodels--facebook--mask2former-swin-large-cityscapes-instance
> β β ββββblobs
> β β ββββrefs
> β β ββββsnapshots
> β β ββββ70fed72d02a138560da931a1c6a2dcfbb56cd2ff
> β ββββmodels--facebook--mask2former-swin-large-cityscapes-panoptic
> β β ββββblobs
> β β ββββrefs
> β β ββββsnapshots
> β β ββββ544d76fe93971ee046dacae19b6d4f6ecb5d9088
> β ββββmodels--google_bert--bert-base-cased
> β ββββmodels--nvidia--mit-b0
> β β ββββ.no_exist
> β β β ββββ80983a413c30d36a39c20203974ae7807835e2b4
> β β ββββblobs
> β β ββββrefs
> β β β ββββrefs
> β β β ββββpr
> β β ββββsnapshots
> β β ββββ25ce79d97e6d9d509ed12e17cb2eb89b0a83a2dc
> β β ββββ80983a413c30d36a39c20203974ae7807835e2b4
> β ββββmodels--nvidia--segformer-b0-finetuned-cityscapes-768-768
> β β ββββblobs
> β β ββββrefs
> β β ββββsnapshots
> β β ββββd3b7801ed329668d5bff04cd33365fa37f538c3b
> β ββββmodels--nvidia--segformer-b1-finetuned-cityscapes-1024-1024
> β ββββ.no_exist
> β β ββββec86afeba68e656629ccf47e0c8d2902f964917b
> β ββββblobs
> β ββββrefs
> β β ββββrefs
> β β ββββpr
> β ββββsnapshots
> β ββββad2bb0101129289844ea62577e6a22adc2752004
> β ββββec86afeba68e656629ccf47e0c8d2902f964917b
> ββββmetrics
> β ββββmean_io_u
> β ββββdefault
> ββββmodules
> ββββdatasets_modules
> β ββββdatasets
> β β ββββscene_parse_150
> β β β ββββd998c54e1b5c5bad12b4d2ec7e1a5f74eee4c153bc1b089a0001677ae9b3fd75
> β β β β ββββ__pycache__
> β β β ββββ__pycache__
> β β ββββ__pycache__
> β ββββ__pycache__
> ββββevaluate_modules
> ββββmetrics
> β ββββevaluate-metric--mean_iou
> β β ββββ9e450724f21f05592bfb0255fe2fa576df8171fa060d11121d8aecfff0db80d0
> β β β ββββ__pycache__
> β β ββββ__pycache__
> β ββββ__pycache__
> ββββ__pycache__
Will appreciate for some help and will help in completing further details, thanks in advance
Describe the bug
I pre-built the dataset:
and it can be used just fine.
now I wipe out
downloads/extracted
and it no longer works.That is I can still load it:
but if I try to use it:
Only if I wipe out the cached dir and rebuild then it starts working as
download/extracted
is back again with extracted files.I think there are 2 issues here:
arrow
files were printed - did I do something incorrectly when creating this dataset?download/extracted
thenload_dataset
should check if it's there and fail or force rebuilding. I am sure this could be a very expensive operation, so probably really solving #1 will not require this check. and this second item is probably an overkill. Other than perhaps if it had an optionalcheck_consistency
flag to do that.Environment info
datasets@main