CUDA Graphs + ZeRO-3 / TP+PP

microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

https://www.deepspeed.ai/

Apache License 2.0

34.98k stars 4.06k forks source link

CUDA Graphs + ZeRO-3 / TP+PP #6552

Open jeromeku opened 2 weeks ago

jeromeku commented 2 weeks ago

Are there examples of using CUDA Graphs with a) ZeRO-3 / ZeRO-3++ or b) MP / PP?

Not sure if the first case would be beneficial or even possible given device syncs from parameter sharding?

tohtana commented 6 days ago

Hi @jeromeku, CUDA Graphs require using the same memory region for every run. As ZeRO3 repeats gathering and discarding memory for sharded parameters, it cannot run with CUDA graphs. It might be possible to run PP with CUDA graphs in theory, but we heavily rely on the dynamic memory allocation by PyTorch. I don't think it is realistic to enable CUDA graphs with our PP engine.