Open suchig opened 2 years ago
@suchig, thanks for this report. Can you please profile and share the latency of forward, backward, and optimizer passes? You can add "wall_clock_breakdown": true
to your json config to profile the DeepSpeed run.
@suchig, thanks for this report. Can you please profile and share the latency of forward, backward, and optimizer passes? You can add
"wall_clock_breakdown": true
to your json config to profile the DeepSpeed run.
Here are few of the log info after setting wall_clock_breakdown [INFO|trainer.py:1013] 2021-10-08 20:03:14,331 >> Running training [INFO|trainer.py:1014] 2021-10-08 20:03:14,332 >> Num examples = 150 [INFO|trainer.py:1015] 2021-10-08 20:03:14,332 >> Num Epochs = 53 [INFO|trainer.py:1016] 2021-10-08 20:03:14,332 >> Instantaneous batch size per device = 8 [INFO|trainer.py:1017] 2021-10-08 20:03:14,332 >> Total train batch size (w. parallel, distributed & accumulation) = 8 [INFO|trainer.py:1018] 2021-10-08 20:03:14,332 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1019] 2021-10-08 20:03:14,332 >> Total optimization steps = 1000 0/1000 [00:00<?, ?it/s][2021-10-08 20:03:15,649] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 435.69 | backward_microstep: 744.13 | backward_inner_microstep: 513.94 | backward_allreduce_microstep: 230.14 | step_microstep: 1.00 [2021-10-08 20:03:15,649] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 435.67 | backward: 744.11 | backward_inner: 513.92 | backward_allreduce: 230.11 | step: 0.98 7/1000 [00:08<19:57, 1.21s/it][2021-10-08 20:03:23,987] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.37 | backward_microstep: 744.88 | backward_inner_microstep: 514.43 | backward_allreduce_microstep: 230.40 | step_microstep: 1.21 [2021-10-08 20:03:23,987] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.34 | backward: 744.86 | backward_inner: 514.41 | backward_allreduce: 230.38 | step: 1.19 8/1000 [00:09<19:49, 1.20s/it][2021-10-08 20:03:25,194] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.65 | backward_microstep: 760.41 | backward_inner_microstep: 514.66 | backward_allreduce_microstep: 245.70 | step_microstep: 1.14
501/1000 [15:20<28:26, 3.42s/it][2021-10-08 20:18:37,138] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.29 | optimizer_gradients: 79.70 | optimizer_step: 572.27 [2021-10-08 20:18:37,139] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.37 | backward_microstep: 752.95 | backward_inner_microstep: 514.47 | backward_allreduce_microstep: 238.44 | step_microstep: 660.55 [2021-10-08 20:18:37,139] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.34 | backward: 752.93 | backward_inner: 514.44 | backward_allreduce: 238.42 | step: 660.52 502/1000 [15:22<24:29, 2.95s/it][2021-10-08 20:18:39,008] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.27 | optimizer_gradients: 79.11 | optimizer_step: 577.47 [2021-10-08 20:18:39,008] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.16 | backward_microstep: 766.42 | backward_inner_microstep: 514.57 | backward_allreduce_microstep: 251.80 | step_microstep: 665.47
[2021-10-08 20:18:39,008] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.14 | backward: 766.40 | backward_inner: 514.55 | backward_allreduce: 251.78 | step: 665.45 503/1000 [15:24<21:45, 2.63s/it][2021-10-08 20:18:40,855] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.87 | optimizer_gradients: 76.92 | optimizer_step: 576.46 [2021-10-08 20:18:40,856] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.88 | backward_microstep: 746.24 | backward_inner_microstep: 514.67 | backward_allreduce_microstep: 231.52 | step_microstep: 662.61
991/1000 [30:26<00:16, 1.83s/it][2021-10-08 20:33:42,880] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.33 | optimizer_gradients: 75.44 | optimizer_step: 565.35 [2021-10-08 20:33:42,881] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.61 | backward_microstep: 746.57 | backward_inner_microstep: 516.68 | backward_allreduce_microstep: 229.84 | step_microstep: 649.29 [2021-10-08 20:33:42,881] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.59 | backward: 746.55 | backward_inner: 516.65 | backward_allreduce: 229.82 | step: 649.27 992/1000 [30:28<00:14, 1.83s/it][2021-10-08 20:33:44,734] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.29 | optimizer_gradients: 82.71 | optimizer_step: 589.82 [2021-10-08 20:33:44,734] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.75 | backward_microstep: 743.88 | backward_inner_microstep: 514.41 | backward_allreduce_microstep: 229.43 | step_microstep: 681.02 [2021-10-08 20:33:44,734] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.72 | backward: 743.87 | backward_inner: 514.39 | backward_allreduce: 229.41 | step: 681.00 993/1000 [30:30<00:12, 1.84s/it][2021-10-08 20:33:46,590] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.29 | optimizer_gradients: 75.56 | optimizer_step: 583.36 [2021-10-08 20:33:46,591] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.50 | backward_microstep: 746.20 | backward_inner_microstep: 515.28 | backward_allreduce_microstep: 230.87 | step_microstep: 667.48 [2021-10-08 20:33:54,064] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.54 | backward: 746.87 | backward_inner: 514.92 | backward_allreduce: 231.88 | step: 678.25 998/1000 [30:39<00:03, 1.86s/it][2021-10-08 20:33:55,903] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.85 | optimizer_gradients: 78.16 | optimizer_step: 574.00 [2021-10-08 20:33:55,904] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.30 | backward_microstep: 740.36 | backward_inner_microstep: 514.58 | backward_allreduce_microstep: 225.73 | step_microstep: 661.18 [2021-10-08 20:33:55,904] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.27 | backward: 740.34 | backward_inner: 514.56 | backward_allreduce: 225.71 | step: 661.16 999/1000 [30:41<00:01, 1.86s/it][2021-10-08 20:33:57,734] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 4.67 | optimizer_gradients: 75.40 | optimizer_step: 568.76 [2021-10-08 20:33:57,735] [INFO] [logging.py:68:log_dist] [Rank 0] step=1000, skipped=17, lr=[9.444444444444445e-07, 9.444444444444445e-07], mom=[(0.9, 0.999), (0.9, 0.999)] [2021-10-08 20:33:57,735] [INFO] [timer.py:160:stop] 0/1000, SamplesPerSec=4.6612122376961915 [2021-10-08 20:33:57,735] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.21 | backward_microstep: 739.09 | backward_inner_microstep: 514.53 | backward_allreduce_microstep: 224.51 | step_microstep: 654.23 [2021-10-08 20:33:57,735] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.18 | backward: 739.07 | backward_inner: 514.51 | backward_allreduce: 224.49 | step: 654.21 {'loss': 0.1628, 'learning_rate': 9.444444444444445e-07, 'epoch': 52.63}
@suchig, thanks. Can you please share similar profiling log for non DeepSpeed case so we can compare the numbers?
Thanks. Are you fine if I run it with python cProfile? That gives a bunch of stats. I am unsure how to filter out backward, forward and optimizer from that.
I looked at pytorch profiler. Unfortunately the code is from UniLM examples. I am really unclear where exactly the model could be called since it in turn uses Trainer from Transformer.
@suchig, no worries. A thought that just crossed my mind that since you can run this model without DeepSpeed this probably means you don't need the ZeRO optimization. Can you remove the zero_optimization
from your ds config?
I am using the below confi. If I remove zero optimization, I am not sure what will remain
{ "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "reduce_scatter": true, "reduce_bucket_size": 2e8, "overlap_comm": true, "contiguous_gradients": true } }
It is okay, to just leave "wall_clock_breakdown": true in the config.
@suchig, ZeRO optimization is just one of the features in the DeepSpeed library. It is possible to use DeepSpeed without enabling ZeRO. In particular ZeRO is designed for training extremely large models that have hundreds of billions of parameters. However, for smaller models the memory optimization techniques for ZeRO could hurt throughput. This is why I suggested disabling ZeRO since the model can fit without DeepSpeed. So running with a config like below will be useful to see the forward/backward/optimizer latencies without ZeRO:
{
"wall_clock_breakdown": true
}
Thanks for the response. Below is the final step with statistics. [2021-10-10 20:51:02,182] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.13 | backward: 525.67 | backward_inner: 506.66 | backward_allreduce: 18.94 | step: 353.33 {'loss': 0.1575, 'learning_rate': 5.555555555555556e-08, 'epoch': 52.63}
Observation:
Thanks for sharing these results. I will address your observations with the view of helping you determine whether deepspeed is the best option you. But first, let me restate that ZeRO is just one of the many features of the DeepSpeed library. You can see a list of DeepSpeed features here.
I am interested in investigating further. I have given the output without Zero offload. What else should I try to get to the root of this?
This model is a trial for us. We do have interest to expand to other model architectures and dataset. We could scale this for our real world dataset or other use cases if we see benefit.
The next steps would be to profile the non-deepspeed run to obtain the forward/backward/optimizer latencies and compare against the non-ZeRO deepspeed runs. Do you know how to obtain these latencies?
I was thinking of using torch. autograd.profiler during training. Or maybe cProfile which will give ton of output that I have to visualize in some fashion.
Sounds good.
Forward timer - I could include timers before and after compute_loss in training step. Backward timer - I could include before and after loss.backward in training step Optimizer - I am not sure.
Please let me know if this would work.
Hi,
We are trying DeepSpeed with UniLM using FUNSD example that is provided here - https://github.com/microsoft/unilm/tree/master/layoutlmv2
It takes 18-ish minutes to finish training without using DeepSpeed in a 16GB Single GPU machine. When I use DeepSpeed with the below mentioned configuration, it took 30+ minutes in the same machine. I tried to add optimizer, scheduler with different learning rates etc. I tried different bucket sizes as well. But that did not make any difference.
Can anyone help with this?
Config: { "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "reduce_scatter": true, "reduce_bucket_size": 2e8, "overlap_comm": true, "contiguous_gradients": true } }
Versions: Pytorch: 1.8.0+cu111 DeepSpeed - 5.3 Transformers - 4.5.3 (version that UniLM was bundled with)