huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.04k stars 2.63k forks source link

Streaming dataset + interleave + DataLoader hangs with multiple workers #3993

Open jpilaul opened 2 years ago

jpilaul commented 2 years ago

Describe the bug

Interleaving multiple iterable datasets that use load_dataset on streaming mode hangs when passed to torch.utils.data.DataLoader with multiple workers.

Steps to reproduce the bug

from datasets import interleave_datasets, load_dataset
from torch.utils.data import DataLoader

en_dataset = load_dataset('oscar', "unshuffled_deduplicated_en", split='train', streaming=True)
fr_dataset = load_dataset('oscar', "unshuffled_deduplicated_fr", split='train', streaming=True)
it_dataset = load_dataset('oscar', "unshuffled_deduplicated_it", split='train', streaming=True)
de_dataset = load_dataset('oscar', "unshuffled_deduplicated_de", split='train', streaming=True)
multilingual_dataset = interleave_datasets([en_dataset, fr_dataset, de_dataset, it_dataset])
multilingual_dataset = multilingual_dataset.with_format('torch')

next(iter(multilingual_dataset))  # works fairly fast

dataloader = DataLoader(multilingual_dataset, batch_size=8, num_workers=4)
for batch in dataloader:
    print(len(batch))  # prints nothing after 30 min of waiting

dataloader = DataLoader(multilingual_dataset, batch_size=8, num_workers=0)
for batch in dataloader:
    print(len(batch))  # prints right away

Expected results

It should be able to iterate the dataset with multiple workers.

Actual results

Prints with results with next(iter(multilingual_dataset)) and num_workers=0 but it prints nothing with num_workers=4 or any number above 0.

Environment info

jpilaul commented 2 years ago

Same thing occurs when streaming files loaded from disk.

lhoestq commented 2 years ago

Hi ! Thanks for reporting, could this be related to https://github.com/huggingface/datasets/issues/3950 ?

Currently streaming datasets only works in single process, but we're working on having in work in distributed setups as well :) (EDIT: done)

jpilaul commented 2 years ago

Hi, thanks for your reply. It seems related :)

Mohammed20201991 commented 1 year ago

+1

lhoestq commented 1 year ago

Please update datasets if you're having this issue. What version are you using ?