Open SumanthRH opened 1 month ago
+1
@tjruwase would be great if someone from the team can take a look at this!
+1
Hi all, thank you for reporting the issue. It seems only forward time has changed with a newer PyTorch and CUDA. It suggests that allgather is slower.
Can you try these to break down the cause of this issue?
stage3_param_persistence_threshold
larger to include all the parameters. If you don't see the difference with the setting, the slow down is from allgather
Describe the bug For ZeRO-3, i'm noticing an increase in training times on g5.48xlarge nodes with torch >= 2.3.1 and CUDA 12.1. I can reproduce this with small and large models, and in some cases this is a 1.5x slowdown (noticed with Llama-2-13b with a 8192 context length run).
I've had some difficulty reproducing this with smaller number of devices (for example, 2 GPUs vs 8). I also can't reproduce this on 4xA100 nodes, so I'm guessing some hardware dependence.
To Reproduce I have a basic script with accelerate + deepspeed. I've basically patched up some profiing code to one of the example scripts from HuggingFace. The script below is for a 500M Llama model, but you can repro for 7B or 13B as well.
My accelerate config is:
My Zero-3 config is:
Command to run
Expected behavior Training time is expected to stay the same or get better with upgrade.
ds_report output The two environments I've tested with :
torch 2.3.1 + CUDA 12.1 :
The older environment: torch 2.0.1 + CUDA 11.8:
Screenshots
I can add more screenshots if needed, but the training times look as follows for the 500M model used in the script:
With profiling:
With profiling:
Note that these training times are small because of the small model and context length used. You can increase context length, increase model size or switch to full param and you should still be able to reproduce this. (and then it becomes less tolerable) . Please let me know if you can't.
System info (please complete the following information):
Launcher context HuggingFace Accelerate
Additional context I can repro with the latest deepspeed version- 14.4 as well. I've given a min repro code with profiler so that it's easy to dive into the specifics of what's happening.