Open lhoestq opened 1 year ago
Or for example an archive on GitHub releases! Before I added support for JXL (locally only, PR still pending) I was considering hosting my files on GitHub instead...
+1 to this. I would like to use 'audiofolder' with a data_dir that's on S3, for example. I don't want to upload my dataset to the Hub, but I would find all the fingerprinting/caching features useful.
Adding to the conversation, Dask also uses fsspec
for this feature.
Dask: How to connect to remote data
Happy to help on this feature :D
+1 to this feature request since I think it also tackles my use-case. I am collaborating with a team, working with a loading script which takes some time to generate the dataset artifacts. It would be very handy to use this as a cloud cache to avoid duplicating the effort.
Currently we could use builder.download_and_prepare(path_to_cloud_storage, storage_options, ...)
to cache the artifacts to cloud storage, but then builder.as_dataset()
yields NotImplementedError: Loading a dataset cached in SomeCloudFileSystem is not supported
Makes sense ! If you want to load locally a dataset that you download_and_prepared on a cloud storage, you would use load_dataset(path_to_cloud_storage)
indeed. It would download the data from the cloud storage, cache them locally, and return a Dataset
.
It seems currently the cached_path
function handles all URLs by get_from_cache
that only supports ftp
and http(s)
here:
https://github.com/huggingface/datasets/blob/b5672a956d5de864e6f5550e493527d962d6ae55/src/datasets/utils/file_utils.py#L181
I guess one can add another condition that handles s3://
or gs://
URLs via fsspec
here.
I could use this functionality, so I put together a PR using @kyamagu's suggestion to use fsspec
in datasets.utils.file_utils
Thanks @dwyatte for adding support for fsspec urls
Let me just reopen this since the original issue is not resolved
I'm not yet understanding how to use https://github.com/huggingface/datasets/pull/5580 in order to use load_dataset(data_files="s3://...")
. Any help/example would be much appreciated :) thanks!
It's still not officially supported x) But you can try to update request_etag
in file_utils.py
to use fsspec_head
instead of http_head
. It is responsible of getting the ETags of the remote files for caching. This change may do the trick for S3 urls
Thank you for your guys help on this and merging in #5580. I manually pulled the changes to my local datasets package (datasets.utils.file_utils.py) since it only seemed to be this file that was changed in the PR and I'm getting the error: InvalidSchema: No connection adapters were found for 's3://bucket/folder/'. I'm calling load_dataset using the S3 URI. When I use the S3 URL I get HTTPError: 403 Client Error. Am I not supposed to use the S3 URI? How do I pull in the changes from this merge? I'm running datasets 2.10.1.
The current implementation depends on gcsfs/s3fs being able to authenticate through some other means e.g., environmental variables. For AWS, it looks like you can set AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
, and AWS_SESSION_TOKEN
Note that while testing this just now, I did note a discrepancy between gcsfs and s3fs that we might want to address where gcsfs passes the timeout from storage_options
here down into the aiohttp.ClientSession.request
, but s3fs does not handle this (tries to pass to the aiobotocore.session.AioSession
constructor raising TypeError: __init__() got an unexpected keyword argument 'requests_timeout'
).
It seems like some work trying to unify kwargs across different fsspec implementations, so if the plan is to pass down storage_options
, I wonder if we should just let users control the timeout (and other kwargs) using that and if not specified, use the default?
Note that while testing this just now, I did note a discrepancy between gcsfs and s3fs that we might want to address where gcsfs passes the timeout from storage_options here down into the aiohttp.ClientSession.request, but s3fs does not handle this (tries to pass to the aiobotocore.session.AioSession constructor raising TypeError: init() got an unexpected keyword argument 'requests_timeout').
It seems like some work trying to unify kwargs across different fsspec implementations, so if the plan is to pass down storage_options, I wonder if we should just let users control the timeout (and other kwargs) and if not specified, use the default?
@lhoestq here's a small PR for this: https://github.com/huggingface/datasets/pull/5673
@lhoestq sorry for being a little dense here but I am very keen to use fsspec / adlfs for for a larger image dataset I have for object detection. I have to keep it on Azure storage and would also like to avoid a full download or zipping (so use load_dataset(..., streaming=True)
. So this development is godsend :) only... I am unable to make it work.
Would you expect the setup to work for:
json
, parquet
....? I appreciate that you mostly focus on s3 but it seems that, similar to the remaining cloud storage functionality, it should also work for Azure blob storage.
I would imagine that something like (Streaming true or false):
d = load_dataset("new_dataset.py", storage_options=storage_options, split="train")
would work with
# new_dataset.py
....
_URL="abfs://container/image_folder```
archive_path = dl_manager.download(_URL)
split_metadata_paths = dl_manager.download(_METADATA_URLS)
return [
datasets.SplitGenerator(
name=datasets.Split.TRAIN,
gen_kwargs={
"annotation_file_path": split_metadata_paths["train"],
"files": dl_manager.iter_files(archive_path)
},
),
...
but I get
Traceback (most recent call last):
... ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/load.py", line 1797, in load_dataset
builder_instance.download_and_prepare(
File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/builder.py", line 890, in download_and_prepare
self._download_and_prepare(
File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
super()._download_and_prepare(
File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/builder.py", line 963, in _download_and_prepare
split_generators = self._split_generators(dl_manager, **split_generators_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/.cache/huggingface/modules/datasets_modules/datasets/new_dataset/dd26a081eab90074f41fa2c821b458424fde393cc73d3d8241aca956d1fb3aa0/new_dataset_script.py", line 56, in _split_generators
archive_path = dl_manager.download(_URL)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/download/download_manager.py", line 427, in download
downloaded_path_or_paths = map_nested(
^^^^^^^^^^^
File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/utils/py_utils.py", line 435, in map_nested
return function(data_struct)
^^^^^^^^^^^^^^^^^^^^^
File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/download/download_manager.py", line 453, in _download
return cached_path(url_or_filename, download_config=download_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "~/miniconda3/envs/hf/lib/python3.11/site-packages/datasets/utils/file_utils.py", line 206, in cached_path
raise ValueError(f"unable to parse {url_or_filename} as a URL or as a local path")
ValueError: unable to parse abfs://container/image_folder as a URL or as a local path
What version of datasets
are you using ?
@lhoestq hello, i still have problem with loading json from S3:
storage_options = { "key": xxxx, "secret": xxx, "endpoint_url": xxxx } path = 's3://xxx/xxxxxxx.json' dataset = load_dataset("json", data_files=path, storage_options=storage_options)
and it throws an error: TypeError: AioSession.init() got an unexpected keyword argument 'hf' and I use the lastest 2.14.4_dev0 version
Hi @lhoestq, thanks for getting back to me :) you have been busy over the summer I see... I was on 2.12.0
. I have updated to 2.14.4
.
Now d = load_dataset("new_dataset.py", storage_options=storage_options, split="train", streaming=True)
works for Azure blob storage (with a local data loader script) when I explicitly list all blobs (I am struggling to make fs.ls(<path>)
work in the script to make the list available to the download manager).
Any chance that it could work out-of-the-box by supplying just the image folder, not the full list of image filenames? It seems that dl_manager.download(_URL)
always wants one or more (possibly archived) files. In my situation, where I don't want to archive or download, it would be great to just supply the folder (seems reasonably doable with fsspec).
Let me know if there is anything I can do to help.
Thanks,
Any chance that it could work out-of-the-box by supplying just the image folder, not the full list of image filenames? It seems that dl_manager.download(_URL) always wants one or more (possibly archived) files. In my situation, where I don't want to archive or download, it would be great to just supply the folder (seems reasonably doable with fsspec).
@mayorblock This is not supported right now, you have to use archives or implement a way to get the list by yourself
TypeError: AioSession.init() got an unexpected keyword argument 'hf'
@hjq133 Can you update fsspec
and try again ?
pip install -U fsspec
thanks for your suggestion,it works now !
I'm seeing same problem as @hjq133 with following versions:
datasets==2.15.0
(venv) ➜ finetuning-llama2 git:(main) ✗ pip freeze | grep s3fs
s3fs==2023.10.0
(venv) ➜ finetuning-llama2 git:(main) ✗ pip freeze | grep fsspec
fsspec==2023.10.0
@lhoestq hello, i still have problem with loading json from S3:
storage_options = { "key": xxxx, "secret": xxx, "endpoint_url": xxxx } path = 's3://xxx/xxxxxxx.json' dataset = load_dataset("json", data_files=path, storage_options=storage_options)
and it throws an error: TypeError: AioSession.init() got an unexpected keyword argument 'hf' and I use the lastest 2.14.4_dev0 version
I am trying to do the same thing, but the loading is just hanging, without any error. @lhoestq is there any documentation how to load from private s3 buckets?
Hi ! S3 support is still experimental. It seems like there is an extra hf
field passed to the s3fs
storage_options that causes this error. I just check the source code of _prepare_single_hop_path_and_storage_options
and I think you can try passing explicitly your own storage_options={"s3": {...}}
. Also note that it's generally better to load datasets from HF (we run extensive tests and benchmarks for speed and robustness)
That worked! Thanks
It seems thought that data_dir=...
doesn't work on s3, only data_files
.
@lhoestq Would this work either with an Azure Blob Storage Container or its respective Azure Machine Learning Datastore? If yes, what would that look like in code? I've tried a couple of combinations but no success so far, on the latest version of datasets
. I need to migrate a dataset to the Azure cloud, load_dataset("path_to_data")
worked perfectly while the files were local only. Thank you!
@mayorblock would you mind sharing how you got it to work? What did you pass as storage_options
? Would it maybe work without a custom data loader script?
This ticket would be of so much help.
@lhoestq I've been using this feature for the last year on GCS without problem, but I think we need to fix an issue with S3 and then document the supported calling patterns to reduce confusion
It looks like datasets
uses a default DownloadConfig
which is where some potentially unintended storage options are getting passed to fsspec
DownloadConfig(
cache_dir=None,
force_download=False,
resume_download=False,
local_files_only=False,
proxies=None,
user_agent=None,
extract_compressed_file=False,
force_extract=False,
delete_extracted=False,
use_etag=True,
num_proc=None,
max_retries=1,
token=None,
ignore_url_params=False,
storage_options={'hf': {'token': None, 'endpoint': 'https://huggingface.co'}},
download_desc=None
)
(specifically the storage_options={'hf': {'token': None, 'endpoint': 'https://huggingface.co'}}
part)
gcsfs
is robust to the extra key in storage options for whatever reason, but s3fs
is not (haven't dug into why). I'm unable to test adlfs
but it looks like people here got it working
Is this an issue that needs fixed with s3fs? Or can we avoid passing these default storage options in some cases?
Update: I think probably https://github.com/huggingface/datasets/pull/6127 is where these default storage options were introduced
Hmm not sure, maybe it has to do with _prepare_single_hop_path_and_storage_options
returning the "hf" storage options when it shouldn't
Also running into this issue downloading a parquet dataset from S3 (Upload worked fine using current main branch).
dataset = Dataset.from_parquet('s3://path-to-file')
raises
TypeError: AioSession.__init__() got an unexpected keyword argument 'hf'
Found that the issue is introduced in #6028
When commenting out the __post_init__ part to set 'hf', I am able to download the dataset.
Would be nice to be able to do
The idea would be to use
fsspec
as indownload_and_prepare
andsave_to_disk
.This has been requested several times already. Some users want to use their data from private cloud storage to train models
related:
https://github.com/huggingface/datasets/issues/3490 https://github.com/huggingface/datasets/issues/5244 forum