Distributed data parallel training for streaming datasets

huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

https://huggingface.co/docs/datasets

Apache License 2.0

19.03k stars 2.63k forks source link

Distributed data parallel training for streaming datasets #4694

Open cyk1337 opened 2 years ago

cyk1337 commented 2 years ago

Feature request

Any documentations for the the load_dataset(streaming=True) for (multi-node multi-GPU) DDP training?

Motivation

Given a bunch of data files, it is expected to split them onto different GPUs. Is there a guide or documentation?

Your contribution

Does it requires manually split on data files for each worker in DatasetBuilder._split_generator()? What isIterableDatasetShard expected to do?

lhoestq commented 2 years ago

Hi ! According to https://huggingface.co/docs/datasets/use_with_pytorch#stream-data you can use the pytorch DataLoader with num_workers>0 to distribute the shards across your workers (it uses torch.utils.data.get_worker_info() to get the worker ID and select the right subsets of shards to use)

~~EDIT: here is a code example~~

# ds = ds.with_format("torch")
# dataloader = DataLoader(ds, num_workers=num_workers)

EDIT: with_format("torch") is not required, now you can just do

dataloader = DataLoader(ds, num_workers=num_workers)

jackfeinmann5 commented 1 year ago

@cyk1337 does streaming datasets with multi-gpu works for you? I am testing on one node with multiple gpus, but this is freezing, https://github.com/huggingface/datasets/issues/5123 In case you could make this work, could you share with me your data-loading codes? thank you

Mohammed20201991 commented 1 year ago

lhoestq commented 1 year ago

This has been implemented in datasets 2.8:

from datasets.distributed import split_dataset_by_node

ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)

docs: https://huggingface.co/docs/datasets/use_with_pytorch#distributed

wconnell commented 1 year ago

i'm having hanging issues with this when using DDP and allocating the datasets with split_dataset_by_node 🤔

edit

I don't want to pollute this thread, but for the sake of following up, I observed hanging close to the final iteration of the dataloader. I think this was happening on the final shard. First, I removed the final shard and things worked. Then (including all shards), I reordered the list of shards: load_dataset('json', data_files=reordered, streaming=True) and no hang.

I won't open an issue yet bc I am not quite sure about this observation.

albertvillanova commented 1 year ago

@wconnell would you mind opening a different bug issue and giving more details? https://github.com/huggingface/datasets/issues/new?assignees=&labels=&template=bug-report.yml

Thanks.