Open davidgilbertson opened 1 year ago
Hi! We've recently made some improvements to save_to_disk
/list_to_disk
(100x faster in some scenarios), so it would help if you could install datasets
directly from main
(pip install git+https://github.com/huggingface/datasets.git
) and re-run the "benchmark".
Great to hear! I'll give it a try when I've got a moment.
@mariosasko is that fix released to pip in the meantime? Asking cause im facing still the same issue (regarding loading images from local paths):
dataset = load_dataset("csv", cache_dir="cache", data_files=["/STORAGE/DATA/mijam/vit/code/list_filtered.csv"], num_proc=16, split="train").cast_column("image", Image())
dataset = dataset.class_encode_column("label")
quite fast.
Then I do save_to_disk()
and some time later:
dataset = load_from_disk('/STORAGE/DATA/mijam/accel/saved_arrow_big')
really slow. In theory it should be quicked since it only loads arrow files, no conversions and so on.
@mjamroz I assume your CSV file stores image file paths. This means save_to_disk
needs to embed the image bytes resulting in a much bigger Arrow file (than the initial one). Maybe specifying num_shards
to make the Arrow files smaller can help (large Arrow files on some systems take a long time to load).
Describe the bug
I have downloaded
openwebtext
(~12GB) and filtered out a small amount of junk (it's still huge). Now, I would like to use this filtered version for future work. It seems I have two choices:load_dataset
each time, relying on the cache mechanism, and re-run my filtering.save_to_disk
and then useload_from_disk
to load the filtered version.The performance of these two approaches is wildly different:
load_dataset
takes about 20 seconds to load the dataset, and a few seconds to re-filter (thanks to the brilliant filter/map caching)load_from_disk
takes 14 minutes! And the second time I tried, the session just crashed (on a machine with 32GB of RAM)I don't know if you'd call this a bug, but it seems like there shouldn't need to be two methods to load from disk, or that they should not take such wildly different amounts of time, or that one should not crash. Or maybe that the docs could offer some guidance about when to pick which method and why two methods exist, or just how do most people do it?
Something I couldn't work out from reading the docs was this: can I modify a dataset from the hub, save it (locally) and use
load_dataset
to load it? This post seemed to suggest that the answer is no.Steps to reproduce the bug
See above
Expected behavior
Load times should be about the same.
Environment info
datasets
version: 2.9.0