huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.05k stars 2.64k forks source link

integrate `load_from_disk` into `load_dataset` #5044

Open stas00 opened 1 year ago

stas00 commented 1 year ago

Is your feature request related to a problem? Please describe.

Is it possible to make load_dataset more universal similar to from_pretrained in transformers so that it can handle the hub, and the local path datasets of all supported types?

Currently one has to choose a different loader depending on how the dataset has been created.

e.g. this won't work:

$ git clone https://huggingface.co/datasets/severo/test-parquet
$ python -c 'from datasets import load_dataset; ds=load_dataset("test-parquet"); \
ds.save_to_disk("my_dataset"); load_dataset("my_dataset")'

[...]

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/load.py", line 1746, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/builder.py", line 704, in download_and_prepare
    self._download_and_prepare(
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/builder.py", line 793, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/builder.py", line 1277, in _prepare_split
    writer.write_table(table)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/arrow_writer.py", line 524, in write_table
    pa_table = table_cast(pa_table, self._schema)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/table.py", line 2005, in table_cast
    return cast_table_to_schema(table, schema)
  File "/home/stas/anaconda3/envs/py38-pt112/lib/python3.8/site-packages/datasets/table.py", line 1968, in cast_table_to_schema
    raise ValueError(f"Couldn't cast\n{table.schema}\nto\n{features}\nbecause column names don't match")
ValueError: Couldn't cast
_data_files: list<item: struct<filename: string>>
  child 0, item: struct<filename: string>
      child 0, filename: string

both times the dataset is being loaded from disk. Why does it fail the second time?

Why can't save_to_disk generate a dataset that can be immediately loaded by load_dataset?

e.g. the simplest hack would be to have save_to_disk add some flag to the saved dataset, that tells load_dataset to internally call load_from_disk. like having save_to_disk create a load_me_with_load_from_disk.txt file ;) and load_dataset will support that feature from saved datasets from new datasets versions. The old ones will still need to use load_from_disk explicitly. Unless the flag is not needed and one can immediately tell by looking at the saved dataset that it was saved via save_to_disk and thus use load_from_disk internally.

The use-case is defining a simple API where the user only ever needs to pass a dataset_name_or_path and it will always just work. Currently one needs to manually add additional switches telling the system whether to use one loading method or the other which works but it's not smooth.

Thank you!

lhoestq commented 1 year ago

I agree the situation is not ideal and it would be awesome to use load_dataset to reload a dataset saved locally !

For context:

If we want to keep the download_and_prepare step for consistency, it would unnecessarily copy the arrow data into the datasets cache. On the other hand if we don't do this step, the cache directory doesn't exist which is inconsistent.

I'm curious, what would you expect to happen in this situation ?

stas00 commented 1 year ago

Thank you for the detailed breakdown, @lhoestq

I'm curious, what would you expect to happen in this situation ?

  1. the simplest solution is to add a flag to the dataset saved by save_to_disk and have load_dataset check that flag - if it's set simply switch control to load_from_disk behind the scenes. So load_dataset detects it's a local filesystem, looks inside to see whether it's something it can cache or whether it should use it directly as is and continues accordingly with one of the 2 dataset-type specific APIs.

  2. the more evolved solution is to look at a dataset produced by save_to_disk as a remote resource like hub. So the first time load_dataset sees it, it'll take a fingerprint and create a normal cached dataset. On subsequent uses it'll again discover it as a remote resource, validate that it has it cached via the fingerprint and serve as a normal dataset.

As you said the cons of approach 2 is that if the dataset is huge it'll make 2 copies on the same machine. So it's possible that both approaches can be integrated. Say if save_to_disc(do_not_cache=True) is passed it'll use solution 1, otherwise solution 2. or could even symlink the huge arrow files to the cache instead? or perhaps it's more intuitive to use load_dataset(do_not_cache=True) instead. So that one can choose whether to make a cached copy or not for the locally saved dataset. i.e. a simple at use point user control.

Surely there are other ways to handle it, this is just one possibility.

lhoestq commented 1 year ago

I think the simplest is to always memory map the local file without copy, but still have a cached directory in the cache at ~/.cache/huggingface instead of saving map results next to the original data.

In practice we can even use symlinks if it makes the implementation simpler

stas00 commented 1 year ago

Yes, so that you always have the cached entry for any dataset, but the "payload" doesn't have to be physically in the cache if it's already on the local filesystem. As you said a symlink will do.

Mikkolehtimaki commented 1 year ago

Any updates?

lhoestq commented 1 year ago

We haven't had the bandwidth to implement this so far. Let me know if you'd be interested in contributing this feature :)

mariusz-jachimowicz-83 commented 1 year ago

@lhoestq I can jump into that. What I don't like is having functions with many parameters input. Even though they are optional, it's always harder to reason about and test such cases. If there are more features worth to work on, feel free to ping me. It's a lot of fun to help :smile:

lhoestq commented 1 year ago

Thanks a lot for your help @mariusz-jachimowicz-83 :)

I think as a first step we could implement an Arrow dataset builder to be able to load and stream Arrow datasets locally or from Hugging Face. Maybe something similar to the Parquet builder at src/datasets/packaged_modules/parquet/parquet.py ?

And we can deal with the disk space optimization as a second step. What do you think ?

(this issue is also related to https://github.com/huggingface/datasets/issues/3035)

mariusz-jachimowicz-83 commented 1 year ago

@lhoestq I made a PR based on suggestion https://github.com/huggingface/datasets/pull/5944. Could you please review it?

mariusz-jachimowicz-83 commented 1 year ago

@lhoestq Let me know if you have further recommendations or anything that you would like to add but you don't have bandwith for.

RaphaelChevasson commented 7 months ago

Any update on this issue? It makes existing scripts and examples fall flat when provided with a customized/preprocessed dataset saved to disk.