Open jeromeku opened 2 weeks ago
Hi @jeromeku, CUDA Graphs require using the same memory region for every run. As ZeRO3 repeats gathering and discarding memory for sharded parameters, it cannot run with CUDA graphs. It might be possible to run PP with CUDA graphs in theory, but we heavily rely on the dynamic memory allocation by PyTorch. I don't think it is realistic to enable CUDA graphs with our PP engine.
Are there examples of using CUDA Graphs with a) ZeRO-3 / ZeRO-3++ or b) MP / PP?
Not sure if the first case would be beneficial or even possible given device syncs from parameter sharding?