Open stas00 opened 2 years ago
I agree the situation is not ideal and it would be awesome to use load_dataset
to reload a dataset saved locally !
For context:
load_dataset
works in three steps: download the dataset, then prepare it as an arrow dataset, and finally return a memory mapped arrow dataset. In particular it creates a cache directory to store the arrow data and the subsequent cache files for map
.
load_from_disk
directly returns a memory mapped dataset from the arrow file (similar to Dataset.from_file
). It doesn't create a cache diretory, instead all the subsequent map
calls write in the same directory as the original data.
If we want to keep the download_and_prepare step for consistency, it would unnecessarily copy the arrow data into the datasets cache. On the other hand if we don't do this step, the cache directory doesn't exist which is inconsistent.
I'm curious, what would you expect to happen in this situation ?
Thank you for the detailed breakdown, @lhoestq
I'm curious, what would you expect to happen in this situation ?
the simplest solution is to add a flag to the dataset saved by save_to_disk
and have load_dataset
check that flag - if it's set simply switch control to load_from_disk
behind the scenes. So load_dataset
detects it's a local filesystem, looks inside to see whether it's something it can cache or whether it should use it directly as is and continues accordingly with one of the 2 dataset-type specific APIs.
the more evolved solution is to look at a dataset produced by save_to_disk
as a remote resource like hub. So the first time load_dataset
sees it, it'll take a fingerprint and create a normal cached dataset. On subsequent uses it'll again discover it as a remote resource, validate that it has it cached via the fingerprint and serve as a normal dataset.
As you said the cons of approach 2 is that if the dataset is huge it'll make 2 copies on the same machine. So it's possible that both approaches can be integrated. Say if save_to_disc(do_not_cache=True)
is passed it'll use solution 1, otherwise solution 2. or could even symlink the huge arrow files to the cache instead? or perhaps it's more intuitive to use load_dataset(do_not_cache=True)
instead. So that one can choose whether to make a cached copy or not for the locally saved dataset. i.e. a simple at use point user control.
Surely there are other ways to handle it, this is just one possibility.
I think the simplest is to always memory map the local file without copy, but still have a cached directory in the cache at ~/.cache/huggingface
instead of saving map
results next to the original data.
In practice we can even use symlinks if it makes the implementation simpler
Yes, so that you always have the cached entry for any dataset, but the "payload" doesn't have to be physically in the cache if it's already on the local filesystem. As you said a symlink will do.
Any updates?
We haven't had the bandwidth to implement this so far. Let me know if you'd be interested in contributing this feature :)
@lhoestq I can jump into that. What I don't like is having functions with many parameters input. Even though they are optional, it's always harder to reason about and test such cases. If there are more features worth to work on, feel free to ping me. It's a lot of fun to help :smile:
Thanks a lot for your help @mariusz-jachimowicz-83 :)
I think as a first step we could implement an Arrow dataset builder to be able to load and stream Arrow datasets locally or from Hugging Face. Maybe something similar to the Parquet builder at src/datasets/packaged_modules/parquet/parquet.py ?
And we can deal with the disk space optimization as a second step. What do you think ?
(this issue is also related to https://github.com/huggingface/datasets/issues/3035)
@lhoestq I made a PR based on suggestion https://github.com/huggingface/datasets/pull/5944. Could you please review it?
@lhoestq Let me know if you have further recommendations or anything that you would like to add but you don't have bandwith for.
Any update on this issue? It makes existing scripts and examples fall flat when provided with a customized/preprocessed dataset saved to disk.
This would be a really useful in terms of user experience.
Is your feature request related to a problem? Please describe.
Is it possible to make
load_dataset
more universal similar tofrom_pretrained
intransformers
so that it can handle the hub, and the local path datasets of all supported types?Currently one has to choose a different loader depending on how the dataset has been created.
e.g. this won't work:
both times the dataset is being loaded from disk. Why does it fail the second time?
Why can't
save_to_disk
generate a dataset that can be immediately loaded byload_dataset
?e.g. the simplest hack would be to have
save_to_disk
add some flag to the saved dataset, that tellsload_dataset
to internally callload_from_disk
. like havingsave_to_disk
create aload_me_with_load_from_disk.txt
file ;) andload_dataset
will support that feature from saved datasets from newdatasets
versions. The old ones will still need to useload_from_disk
explicitly. Unless the flag is not needed and one can immediately tell by looking at the saved dataset that it was saved viasave_to_disk
and thus useload_from_disk
internally.The use-case is defining a simple API where the user only ever needs to pass a
dataset_name_or_path
and it will always just work. Currently one needs to manually add additional switches telling the system whether to use one loading method or the other which works but it's not smooth.Thank you!