huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.29k stars 2.7k forks source link

load `streaming=True` dataset with downloaded cache #7040

Open wanghaoyucn opened 4 months ago

wanghaoyucn commented 4 months ago

Describe the bug

We build a dataset which contains several hdf5 files and write a script using h5py to generate the dataset. The hdf5 files are large and the processed dataset cache takes more disk space. So we hope to try streaming iterable dataset. Unfortunately, h5py can't convert a remote URL into a hdf5 file descriptor. So we use fsspec as an interface like below:

def _generate_examples(self, filepath, split):
        for file in filepath:
            with fsspec.open(file, "rb") as fs:
                with h5py.File(fs, "r") as fp:
                    # for event_id in sorted(list(fp.keys())):
                    event_ids = list(fp.keys())
                    ......

Steps to reproduce the bug

The fsspec works, but it takes 10+ min to print the first 10 examples, which is even longer than the downloading time. I'm not sure if it just caches the whole hdf5 file and generates the examples.

Expected behavior

So does the following make sense so far?

  1. download the files

    dataset = datasets.load('path/to/myscripts', split="train", name="event", trust_remote_code=True)
  2. load the iterable dataset faster (using the raw file cache at path .cache/huggingface/datasets/downloads)

    dataset = datasets.load('path/to/myscripts', split="train", name="event", trust_remote_code=True, streaming=true)

I made some tests, but the code above can't get the expected result. I'm not sure if this is supported. I also find the issue #6327 . It seemed similar to mine, but I couldn't find a solution.

Environment info

albertvillanova commented 4 months ago

When you pass streaming=True, the cache is ignored. The remote data URL is used instead and the data is streamed from the remote server.

wanghaoyucn commented 4 months ago

Thanks for your reply! So is there any solution to get my expected behavior besides clone the whole repo ? Or could I adjust my script to load the downloaded arrow files and generate the dataset streamingly?