Support cloud storage in load_dataset

lhoestq commented 1 year ago

Would be nice to be able to do

data_files=["s3://..."]  # or gs:// or any cloud storage path
storage_options = {...}
load_dataset(..., data_files=data_files, storage_options=storage_options)

The idea would be to use fsspec as in download_and_prepare and save_to_disk.

This has been requested several times already. Some users want to use their data from private cloud storage to train models

alexjc commented 1 year ago

Or for example an archive on GitHub releases! Before I added support for JXL (locally only, PR still pending) I was considering hosting my files on GitHub instead...

iceboundflame commented 1 year ago

+1 to this. I would like to use 'audiofolder' with a data_dir that's on S3, for example. I don't want to upload my dataset to the Hub, but I would find all the fingerprinting/caching features useful.

Dref360 commented 1 year ago

Adding to the conversation, Dask also uses fsspec for this feature.

Dask: How to connect to remote data

Happy to help on this feature :D

eballesteros commented 1 year ago

+1 to this feature request since I think it also tackles my use-case. I am collaborating with a team, working with a loading script which takes some time to generate the dataset artifacts. It would be very handy to use this as a cloud cache to avoid duplicating the effort.

Currently we could use builder.download_and_prepare(path_to_cloud_storage, storage_options, ...) to cache the artifacts to cloud storage, but then builder.as_dataset() yields NotImplementedError: Loading a dataset cached in SomeCloudFileSystem is not supported

lhoestq commented 1 year ago

Makes sense ! If you want to load locally a dataset that you download_and_prepared on a cloud storage, you would use load_dataset(path_to_cloud_storage) indeed. It would download the data from the cloud storage, cache them locally, and return a Dataset.

kyamagu commented 1 year ago

It seems currently the cached_path function handles all URLs by get_from_cache that only supports ftp and http(s) here: https://github.com/huggingface/datasets/blob/b5672a956d5de864e6f5550e493527d962d6ae55/src/datasets/utils/file_utils.py#L181

I guess one can add another condition that handles s3:// or gs:// URLs via fsspec here.

dwyatte commented 1 year ago

I could use this functionality, so I put together a PR using @kyamagu's suggestion to use fsspec in datasets.utils.file_utils

https://github.com/huggingface/datasets/pull/5580

lhoestq commented 1 year ago

Thanks @dwyatte for adding support for fsspec urls

Let me just reopen this since the original issue is not resolved

janmaltel commented 1 year ago

I'm not yet understanding how to use https://github.com/huggingface/datasets/pull/5580 in order to use load_dataset(data_files="s3://..."). Any help/example would be much appreciated :) thanks!

lhoestq commented 1 year ago

It's still not officially supported x) But you can try to update request_etag in file_utils.py to use fsspec_head instead of http_head. It is responsible of getting the ETags of the remote files for caching. This change may do the trick for S3 urls

ssabatier commented 1 year ago

Thank you for your guys help on this and merging in #5580. I manually pulled the changes to my local datasets package (datasets.utils.file_utils.py) since it only seemed to be this file that was changed in the PR and I'm getting the error: InvalidSchema: No connection adapters were found for 's3://bucket/folder/'. I'm calling load_dataset using the S3 URI. When I use the S3 URL I get HTTPError: 403 Client Error. Am I not supposed to use the S3 URI? How do I pull in the changes from this merge? I'm running datasets 2.10.1.

dwyatte commented 1 year ago

The current implementation depends on gcsfs/s3fs being able to authenticate through some other means e.g., environmental variables. For AWS, it looks like you can set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_SESSION_TOKEN

Note that while testing this just now, I did note a discrepancy between gcsfs and s3fs that we might want to address where gcsfs passes the timeout from storage_options here down into the aiohttp.ClientSession.request, but s3fs does not handle this (tries to pass to the aiobotocore.session.AioSession constructor raising TypeError: __init__() got an unexpected keyword argument 'requests_timeout').

It seems like some work trying to unify kwargs across different fsspec implementations, so if the plan is to pass down storage_options, I wonder if we should just let users control the timeout (and other kwargs) using that and if not specified, use the default?

dwyatte commented 1 year ago

Note that while testing this just now, I did note a discrepancy between gcsfs and s3fs that we might want to address where gcsfs passes the timeout from storage_options here down into the aiohttp.ClientSession.request, but s3fs does not handle this (tries to pass to the aiobotocore.session.AioSession constructor raising TypeError: init() got an unexpected keyword argument 'requests_timeout').

It seems like some work trying to unify kwargs across different fsspec implementations, so if the plan is to pass down storage_options, I wonder if we should just let users control the timeout (and other kwargs) and if not specified, use the default?

@lhoestq here's a small PR for this: https://github.com/huggingface/datasets/pull/5673

mayorblock commented 1 year ago

@lhoestq sorry for being a little dense here but I am very keen to use fsspec / adlfs for for a larger image dataset I have for object detection. I have to keep it on Azure storage and would also like to avoid a full download or zipping (so use load_dataset(..., streaming=True). So this development is godsend :) only... I am unable to make it work.

Would you expect the setup to work for:

azure blob storage
image files (not the standard formats json, parquet....

? I appreciate that you mostly focus on s3 but it seems that, similar to the remaining cloud storage functionality, it should also work for Azure blob storage.

I would imagine that something like (Streaming true or false):

d = load_dataset("new_dataset.py", storage_options=storage_options, split="train")

would work with

# new_dataset.py
....
_URL="abfs://container/image_folder``` 

archive_path = dl_manager.download(_URL)
split_metadata_paths = dl_manager.download(_METADATA_URLS)
return [
    datasets.SplitGenerator(
        name=datasets.Split.TRAIN,
        gen_kwargs={
            "annotation_file_path": split_metadata_paths["train"],
            "files": dl_manager.iter_files(archive_path)
},
      ),
...

but I get

Traceback (most recent call last):
...        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/load.py", line 1797, in load_dataset
    builder_instance.download_and_prepare(
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/builder.py", line 890, in download_and_prepare
    self._download_and_prepare(
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/builder.py", line 963, in _download_and_prepare
    split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/.cache/huggingface/modules/datasets_modules/datasets/new_dataset/dd26a081eab90074f41fa2c821b458424fde393cc73d3d8241aca956d1fb3aa0/new_dataset_script.py", line 56, in _split_generators
    archive_path = dl_manager.download(_URL)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/download/download_manager.py", line 427, in download
    downloaded_path_or_paths = map_nested(
                               ^^^^^^^^^^^
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 435, in map_nested
    return function(data_struct)
           ^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/download/download_manager.py", line 453, in _download
    return cached_path(url_or_filename, download_config=download_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/utils/file_utils.py", line 206, in cached_path
    raise ValueError(f"unable to parse {url_or_filename} as a URL or as a local path")
ValueError: unable to parse abfs://container/image_folder as a URL or as a local path

lhoestq commented 1 year ago

What version of datasets are you using ?

hjq133 commented 1 year ago

@lhoestq hello, i still have problem with loading json from S3:

storage_options = { "key": xxxx, "secret": xxx, "endpoint_url": xxxx } path = 's3://xxx/xxxxxxx.json' dataset = load_dataset("json", data_files=path, storage_options=storage_options)

and it throws an error: TypeError: AioSession.init() got an unexpected keyword argument 'hf' and I use the lastest 2.14.4_dev0 version

mayorblock commented 1 year ago

Hi @lhoestq, thanks for getting back to me :) you have been busy over the summer I see... I was on 2.12.0. I have updated to 2.14.4.

Now d = load_dataset("new_dataset.py", storage_options=storage_options, split="train", streaming=True) works for Azure blob storage (with a local data loader script) when I explicitly list all blobs (I am struggling to make fs.ls(<path>) work in the script to make the list available to the download manager).

Any chance that it could work out-of-the-box by supplying just the image folder, not the full list of image filenames? It seems that dl_manager.download(_URL) always wants one or more (possibly archived) files. In my situation, where I don't want to archive or download, it would be great to just supply the folder (seems reasonably doable with fsspec).

Let me know if there is anything I can do to help.

Thanks,

lhoestq commented 1 year ago

Any chance that it could work out-of-the-box by supplying just the image folder, not the full list of image filenames? It seems that dl_manager.download(_URL) always wants one or more (possibly archived) files. In my situation, where I don't want to archive or download, it would be great to just supply the folder (seems reasonably doable with fsspec).

@mayorblock This is not supported right now, you have to use archives or implement a way to get the list by yourself

TypeError: AioSession.init() got an unexpected keyword argument 'hf'

@hjq133 Can you update fsspec and try again ?

pip install -U fsspec

hjq133 commented 1 year ago

thanks for your suggestion，it works now !

mariokostelac commented 11 months ago

I'm seeing same problem as @hjq133 with following versions:

datasets==2.15.0
(venv) ➜  finetuning-llama2 git:(main) ✗ pip freeze | grep s3fs   
s3fs==2023.10.0
(venv) ➜  finetuning-llama2 git:(main) ✗ pip freeze | grep fsspec
fsspec==2023.10.0

aarbelle commented 11 months ago

@lhoestq hello, i still have problem with loading json from S3:

storage_options = { "key": xxxx, "secret": xxx, "endpoint_url": xxxx } path = 's3://xxx/xxxxxxx.json' dataset = load_dataset("json", data_files=path, storage_options=storage_options)

and it throws an error: TypeError: AioSession.init() got an unexpected keyword argument 'hf' and I use the lastest 2.14.4_dev0 version

I am trying to do the same thing, but the loading is just hanging, without any error. @lhoestq is there any documentation how to load from private s3 buckets?

lhoestq commented 10 months ago

Hi ! S3 support is still experimental. It seems like there is an extra hf field passed to the s3fs storage_options that causes this error. I just check the source code of _prepare_single_hop_path_and_storage_options and I think you can try passing explicitly your own storage_options={"s3": {...}}. Also note that it's generally better to load datasets from HF (we run extensive tests and benchmarks for speed and robustness)

aarbelle commented 10 months ago

That worked! Thanks It seems thought that data_dir=... doesn't work on s3, only data_files.

csanadpoda commented 10 months ago

@lhoestq Would this work either with an Azure Blob Storage Container or its respective Azure Machine Learning Datastore? If yes, what would that look like in code? I've tried a couple of combinations but no success so far, on the latest version of datasets. I need to migrate a dataset to the Azure cloud, load_dataset("path_to_data") worked perfectly while the files were local only. Thank you!

@mayorblock would you mind sharing how you got it to work? What did you pass as storage_options? Would it maybe work without a custom data loader script?

Sritharan-racap commented 10 months ago

This ticket would be of so much help.

dwyatte commented 9 months ago

@lhoestq I've been using this feature for the last year on GCS without problem, but I think we need to fix an issue with S3 and then document the supported calling patterns to reduce confusion

It looks like datasets uses a default DownloadConfig which is where some potentially unintended storage options are getting passed to fsspec

DownloadConfig(
    cache_dir=None, 
    force_download=False, 
    resume_download=False, 
    local_files_only=False, 
    proxies=None, 
    user_agent=None, 
    extract_compressed_file=False, 
    force_extract=False, 
    delete_extracted=False, 
    use_etag=True, 
    num_proc=None, 
    max_retries=1, 
    token=None, 
    ignore_url_params=False, 
    storage_options={'hf': {'token': None, 'endpoint': 'https://huggingface.co'}}, 
    download_desc=None
)

(specifically the storage_options={'hf': {'token': None, 'endpoint': 'https://huggingface.co'}} part)

gcsfs is robust to the extra key in storage options for whatever reason, but s3fs is not (haven't dug into why). I'm unable to test adlfs but it looks like people here got it working

Is this an issue that needs fixed with s3fs? Or can we avoid passing these default storage options in some cases?

Update: I think probably https://github.com/huggingface/datasets/pull/6127 is where these default storage options were introduced

lhoestq commented 9 months ago

Hmm not sure, maybe it has to do with _prepare_single_hop_path_and_storage_options returning the "hf" storage options when it shouldn't

Crashkurs commented 7 months ago

Also running into this issue downloading a parquet dataset from S3 (Upload worked fine using current main branch). dataset = Dataset.from_parquet('s3://path-to-file') raises TypeError: AioSession.__init__() got an unexpected keyword argument 'hf' Found that the issue is introduced in #6028

When commenting out the __post_init__ part to set 'hf', I am able to download the dataset.

huggingface / datasets

Support cloud storage in load_dataset #5281