huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.28k stars 2.7k forks source link

load_from_disk #7268

Open ghaith-mq opened 3 weeks ago

ghaith-mq commented 3 weeks ago

Describe the bug

I have data saved with save_to_disk. The data is big (700Gb). When I try loading it, the only option is load_from_disk, and this function copies the data to a tmp directory, causing me to run out of disk space. Is there an alternative solution to that?

Steps to reproduce the bug

when trying to load data using load_From_disk after being saved using save_to_disk

Expected behavior

run out of disk space

Environment info

lateest version

Jourdelune commented 3 weeks ago

Hello, It's an interesting issue here. I have the same problem, I have a local dataset and I want to push the dataset to the hub but huggingface does a copy of it.

from datasets import load_dataset

dataset = load_dataset("webdataset", data_files="/media/works/data/*.tar") # copy here
dataset.push_to_hub("WaveGenAI/audios2")

Edit: I can use HfApi for my use case