Open wanghaoyucn opened 4 months ago
When you pass streaming=True
, the cache is ignored. The remote data URL is used instead and the data is streamed from the remote server.
Thanks for your reply! So is there any solution to get my expected behavior besides clone the whole repo ? Or could I adjust my script to load the downloaded arrow files and generate the dataset streamingly?
Describe the bug
We build a dataset which contains several hdf5 files and write a script using
h5py
to generate the dataset. The hdf5 files are large and the processed dataset cache takes more disk space. So we hope to try streaming iterable dataset. Unfortunately,h5py
can't convert a remote URL into a hdf5 file descriptor. So we usefsspec
as an interface like below:Steps to reproduce the bug
The
fsspec
works, but it takes 10+ min to print the first 10 examples, which is even longer than the downloading time. I'm not sure if it just caches the whole hdf5 file and generates the examples.Expected behavior
So does the following make sense so far?
download the files
load the iterable dataset faster (using the raw file cache at path
.cache/huggingface/datasets/downloads
)I made some tests, but the code above can't get the expected result. I'm not sure if this is supported. I also find the issue #6327 . It seemed similar to mine, but I couldn't find a solution.
Environment info
datasets
= 2.18.0h5py
= 3.10.0fsspec
= 2023.10.0