Support tensor parallel/pipeline parallel

mosaicml / streaming

A Data Streaming Library for Efficient Neural Network Training

https://streaming.docs.mosaicml.com

Apache License 2.0

1.03k stars 126 forks source link

Open gongel opened 10 months ago

gongel commented 10 months ago

Support tensor parallel/pipeline parallel currently?

karan6181 commented 10 months ago

Can you please share more details ?

gongel commented 10 months ago

NVIDIA-Megatron team proposed "Tensor Parallelism". When training in "Tensor Parallelism", the rank in same group has same data. Paper: https://arxiv.org/pdf/2205.05198.pdf Repo: https://github.com/NVIDIA/Megatron-LM

But in streaming, you only support DDP/FSDP.

andreamad8 commented 5 months ago

Any plan to add this?

one easy solution that seems not to be working could be:

os.environ["WORLD_SIZE"] = str(os.environ["WORLD_SIZE"]  // model_parallel_size)
os.environ["RANK"] = str(os.environ["RANK"] // model_parallel_size)

I tried, but seems the code gets stuck after calling something like:

batch = next(batch_iterator)

where batch_iterator a dataloder.

cc: @karan6181

karan6181 commented 4 months ago

@snarayan21 Looks like this is being addressed. Is that right?

huxuan commented 1 month ago

Would like to know if there is any example of megatron integration.