huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.24k stars 2.69k forks source link

`load_from_disk` vs `load_dataset` performance. #5609

Open davidgilbertson opened 1 year ago

davidgilbertson commented 1 year ago

Describe the bug

I have downloaded openwebtext (~12GB) and filtered out a small amount of junk (it's still huge). Now, I would like to use this filtered version for future work. It seems I have two choices:

  1. Use load_dataset each time, relying on the cache mechanism, and re-run my filtering.
  2. save_to_disk and then use load_from_disk to load the filtered version.

The performance of these two approaches is wildly different:

I don't know if you'd call this a bug, but it seems like there shouldn't need to be two methods to load from disk, or that they should not take such wildly different amounts of time, or that one should not crash. Or maybe that the docs could offer some guidance about when to pick which method and why two methods exist, or just how do most people do it?

Something I couldn't work out from reading the docs was this: can I modify a dataset from the hub, save it (locally) and use load_dataset to load it? This post seemed to suggest that the answer is no.

Steps to reproduce the bug

See above

Expected behavior

Load times should be about the same.

Environment info

mariosasko commented 1 year ago

Hi! We've recently made some improvements to save_to_disk/list_to_disk (100x faster in some scenarios), so it would help if you could install datasets directly from main (pip install git+https://github.com/huggingface/datasets.git) and re-run the "benchmark".

davidgilbertson commented 1 year ago

Great to hear! I'll give it a try when I've got a moment.

mjamroz commented 1 year ago

@mariosasko is that fix released to pip in the meantime? Asking cause im facing still the same issue (regarding loading images from local paths):

dataset = load_dataset("csv", cache_dir="cache", data_files=["/STORAGE/DATA/mijam/vit/code/list_filtered.csv"], num_proc=16, split="train").cast_column("image", Image())
dataset = dataset.class_encode_column("label")

quite fast.

Then I do save_to_disk() and some time later:

dataset = load_from_disk('/STORAGE/DATA/mijam/accel/saved_arrow_big')

really slow. In theory it should be quicked since it only loads arrow files, no conversions and so on.

mariosasko commented 1 year ago

@mjamroz I assume your CSV file stores image file paths. This means save_to_disk needs to embed the image bytes resulting in a much bigger Arrow file (than the initial one). Maybe specifying num_shards to make the Arrow files smaller can help (large Arrow files on some systems take a long time to load).