Open ghaith-mq opened 3 weeks ago
Hello, It's an interesting issue here. I have the same problem, I have a local dataset and I want to push the dataset to the hub but huggingface does a copy of it.
from datasets import load_dataset
dataset = load_dataset("webdataset", data_files="/media/works/data/*.tar") # copy here
dataset.push_to_hub("WaveGenAI/audios2")
Edit: I can use HfApi for my use case
Describe the bug
I have data saved with save_to_disk. The data is big (700Gb). When I try loading it, the only option is load_from_disk, and this function copies the data to a tmp directory, causing me to run out of disk space. Is there an alternative solution to that?
Steps to reproduce the bug
when trying to load data using load_From_disk after being saved using save_to_disk
Expected behavior
run out of disk space
Environment info
lateest version