Open Vadim2S opened 1 year ago
@Vadim2S, DeepSpeed has many configurable parameters and has various versions for different optimization needs. Some further investigation might help identify the issue you're reporting here. 1, you can use profilers to see the performance bottlenecks and detailed performance report with 1 and 2 GPUs. To this end, you can follow the instructions here: https://www.deepspeed.ai/tutorials/pytorch-profiler/ 2, there is a deepspeed autotuning tool might be helpful to locate the best configurations. here is the instructions on how to use it. https://www.deepspeed.ai/tutorials/autotuning/ thanks.
close as no further input from users, will open if requested further.
Hi, I'm experiencing the same problem. May I ask how you solved it.
+1
+1
@Mars2018, @slchenchn, @gray311 can you please provide repro steps? Thanks!
I am try DeepSpeed. I am read docs and modify one project for it.
And I am get strange result:
1) Original code without any speed up. 1 docker container. 1 GPU. 10 epoch. Time: 5 min 50 sec. One epoch time: 30 sec. GPU0:11378MB, GPU1:0MB.
2) Deepspeed code. 1 docker container. 1 GPU. 10 epoch. Time: 6 min 29 sec. One epoch time: 34 sec. GPU0:11490MB, GPU1:0MB.
3) Deepspeed code. 1 docker container. 2 GPU. 10 epoch. Time: 6 min 25 sec. One epoch time: 33 sec. GPU0:11386MB, GPU1:11386MB.
4) Deepspeed code. 2 docker container (on same computer). 1 GPU per container. 10 epoch. Time: 6 min 28 sec. One epoch time: 33 sec. GPU0:11384MB, GPU1:11384MB.
5) PyTorch.DDP code. 1 docker container. 2 GPU. 10 epoch. Time: 3 min 28 sec. One epoch time: 18 sec. GPU0:12203MB, GPU1:11422MB.
6) PyTorch.DDP code. 2 docker container (on same computer). 1 GPU per container. 10 epoch. Time: 3 min 29 sec. One epoch time: 18 sec. GPU0:11422MB, GPU1:11422MB.
As you may see: DeepSpeed use both GPU but slower than one GPU training. PyTorch.DDP work as expected.
My model is simple and I am do not use Transformer framework. I am do not expect very good DeepSpeed speed up results in my case. I am just wonder - Is I am missing something? Preferences/properties/config/code line/etc ? Some mistakes?
Here my script:
Here is code (a bit stripped for clear reading)
Here is my deepspeed config
Here is my Rank 0 node log:
Here is my Rank 1 node log: