huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.06k stars 2.64k forks source link

Directly reading parquet files in a s3 bucket from the load_dataset method #5566

Open shamanez opened 1 year ago

shamanez commented 1 year ago

Feature request

Right now, we have to read the get the parquet file to the local storage. So having ability to read given the bucket directly address would be benificial

Motivation

In a production set up, this feature can help us a lot. So we do not need move training datafiles in between storage.

Your contribution

I am willing to help if there's anyway.

lhoestq commented 1 year ago

Hi ! I think is in the scope of this other issue: to https://github.com/huggingface/datasets/issues/5281