huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.24k stars 2.69k forks source link

save_to_disk() freezes when saving on s3 bucket with multiprocessing #6936

Open ycattan opened 5 months ago

ycattan commented 5 months ago

Describe the bug

I'm trying to save a Dataset using the save_to_disk() function with:

The hf progress bar shows up but the saving does not seem to start. When using one processor only (num_proc=1), everything works fine. When saving the dataset on local disk (as opposed to s3 bucket) with num_proc > 1, everything works fine.

Thank you for your help! :)

Steps to reproduce the bug

I tried without any storage options:

from datasets import load_dataset

sandbox_ds = load_dataset("openai_humaneval")
sandbox_ds["test"].save_to_disk(
    "s3://bucket-name/test_multiprocessing_saving/",
    num_proc=4,
)

and with the specific s3fs storage options:

from datasets import load_dataset
from s3fs import S3FileSystem

def get_s3fs():
    return S3FileSystem()

sandbox_ds = load_dataset("openai_humaneval")
sandbox_ds["test"].save_to_disk(
    "s3://bucket-name/test_multiprocessing_saving/",
    num_proc=4,
    storage_options=get_s3fs().storage_options, # also tried: storage_options=S3FileSystem().storage_options
)

I'm guessing I might use storage_options parameter wrongly, but I didn't find anything online that made it work.

NB: Behavior is the same when trying to save the whole DatasetDict.

Expected behavior

Progress bar fills in and saving is carried out.

Environment info

datasets==2.18.0

sfc-gh-ywei commented 3 months ago

I got the same issue. Any updates so far for this issue?