`load_from_disk` vs `load_dataset` performance.

davidgilbertson commented 1 year ago

Describe the bug

I have downloaded openwebtext (~12GB) and filtered out a small amount of junk (it's still huge). Now, I would like to use this filtered version for future work. It seems I have two choices:

Use load_dataset each time, relying on the cache mechanism, and re-run my filtering.
save_to_disk and then use load_from_disk to load the filtered version.

The performance of these two approaches is wildly different:

Using load_dataset takes about 20 seconds to load the dataset, and a few seconds to re-filter (thanks to the brilliant filter/map caching)
Using load_from_disk takes 14 minutes! And the second time I tried, the session just crashed (on a machine with 32GB of RAM)

I don't know if you'd call this a bug, but it seems like there shouldn't need to be two methods to load from disk, or that they should not take such wildly different amounts of time, or that one should not crash. Or maybe that the docs could offer some guidance about when to pick which method and why two methods exist, or just how do most people do it?

Something I couldn't work out from reading the docs was this: can I modify a dataset from the hub, save it (locally) and use load_dataset to load it? This post seemed to suggest that the answer is no.

Steps to reproduce the bug

See above

Expected behavior

Load times should be about the same.

Environment info

datasets version: 2.9.0
Platform: Linux-5.10.102.1-microsoft-standard-WSL2-x86_64-with-glibc2.31
Python version: 3.10.8
PyArrow version: 11.0.0
Pandas version: 1.5.3

mariosasko commented 1 year ago

Hi! We've recently made some improvements to save_to_disk/list_to_disk (100x faster in some scenarios), so it would help if you could install datasets directly from main (pip install git+https://github.com/huggingface/datasets.git) and re-run the "benchmark".

davidgilbertson commented 1 year ago

Great to hear! I'll give it a try when I've got a moment.

mjamroz commented 1 year ago

@mariosasko is that fix released to pip in the meantime? Asking cause im facing still the same issue (regarding loading images from local paths):

dataset = load_dataset("csv", cache_dir="cache", data_files=["/STORAGE/DATA/mijam/vit/code/list_filtered.csv"], num_proc=16, split="train").cast_column("image", Image())
dataset = dataset.class_encode_column("label")

quite fast.

Then I do save_to_disk() and some time later:

dataset = load_from_disk('/STORAGE/DATA/mijam/accel/saved_arrow_big')

really slow. In theory it should be quicked since it only loads arrow files, no conversions and so on.

mariosasko commented 1 year ago

@mjamroz I assume your CSV file stores image file paths. This means save_to_disk needs to embed the image bytes resulting in a much bigger Arrow file (than the initial one). Maybe specifying num_shards to make the Arrow files smaller can help (large Arrow files on some systems take a long time to load).

huggingface / datasets