microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
33.72k stars 3.96k forks source link

Performance degradation after using DeepSpeed (UniLM/layoutlmv2) #1436

Open suchig opened 2 years ago

suchig commented 2 years ago

Hi,

We are trying DeepSpeed with UniLM using FUNSD example that is provided here - https://github.com/microsoft/unilm/tree/master/layoutlmv2

It takes 18-ish minutes to finish training without using DeepSpeed in a 16GB Single GPU machine. When I use DeepSpeed with the below mentioned configuration, it took 30+ minutes in the same machine. I tried to add optimizer, scheduler with different learning rates etc. I tried different bucket sizes as well. But that did not make any difference.

Can anyone help with this?

Config: { "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "reduce_scatter": true, "reduce_bucket_size": 2e8, "overlap_comm": true, "contiguous_gradients": true } }

Versions: Pytorch: 1.8.0+cu111 DeepSpeed - 5.3 Transformers - 4.5.3 (version that UniLM was bundled with)

tjruwase commented 2 years ago

@suchig, thanks for this report. Can you please profile and share the latency of forward, backward, and optimizer passes? You can add "wall_clock_breakdown": true to your json config to profile the DeepSpeed run.

suchig commented 2 years ago

@suchig, thanks for this report. Can you please profile and share the latency of forward, backward, and optimizer passes? You can add "wall_clock_breakdown": true to your json config to profile the DeepSpeed run.

Here are few of the log info after setting wall_clock_breakdown [INFO|trainer.py:1013] 2021-10-08 20:03:14,331 >> Running training [INFO|trainer.py:1014] 2021-10-08 20:03:14,332 >> Num examples = 150 [INFO|trainer.py:1015] 2021-10-08 20:03:14,332 >> Num Epochs = 53 [INFO|trainer.py:1016] 2021-10-08 20:03:14,332 >> Instantaneous batch size per device = 8 [INFO|trainer.py:1017] 2021-10-08 20:03:14,332 >> Total train batch size (w. parallel, distributed & accumulation) = 8 [INFO|trainer.py:1018] 2021-10-08 20:03:14,332 >> Gradient Accumulation steps = 1 [INFO|trainer.py:1019] 2021-10-08 20:03:14,332 >> Total optimization steps = 1000 0/1000 [00:00<?, ?it/s][2021-10-08 20:03:15,649] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 435.69 | backward_microstep: 744.13 | backward_inner_microstep: 513.94 | backward_allreduce_microstep: 230.14 | step_microstep: 1.00 [2021-10-08 20:03:15,649] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 435.67 | backward: 744.11 | backward_inner: 513.92 | backward_allreduce: 230.11 | step: 0.98 7/1000 [00:08<19:57, 1.21s/it][2021-10-08 20:03:23,987] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.37 | backward_microstep: 744.88 | backward_inner_microstep: 514.43 | backward_allreduce_microstep: 230.40 | step_microstep: 1.21 [2021-10-08 20:03:23,987] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.34 | backward: 744.86 | backward_inner: 514.41 | backward_allreduce: 230.38 | step: 1.19 8/1000 [00:09<19:49, 1.20s/it][2021-10-08 20:03:25,194] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.65 | backward_microstep: 760.41 | backward_inner_microstep: 514.66 | backward_allreduce_microstep: 245.70 | step_microstep: 1.14

501/1000 [15:20<28:26, 3.42s/it][2021-10-08 20:18:37,138] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.29 | optimizer_gradients: 79.70 | optimizer_step: 572.27 [2021-10-08 20:18:37,139] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.37 | backward_microstep: 752.95 | backward_inner_microstep: 514.47 | backward_allreduce_microstep: 238.44 | step_microstep: 660.55 [2021-10-08 20:18:37,139] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.34 | backward: 752.93 | backward_inner: 514.44 | backward_allreduce: 238.42 | step: 660.52 502/1000 [15:22<24:29, 2.95s/it][2021-10-08 20:18:39,008] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.27 | optimizer_gradients: 79.11 | optimizer_step: 577.47 [2021-10-08 20:18:39,008] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.16 | backward_microstep: 766.42 | backward_inner_microstep: 514.57 | backward_allreduce_microstep: 251.80 | step_microstep: 665.47

[2021-10-08 20:18:39,008] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.14 | backward: 766.40 | backward_inner: 514.55 | backward_allreduce: 251.78 | step: 665.45 503/1000 [15:24<21:45, 2.63s/it][2021-10-08 20:18:40,855] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.87 | optimizer_gradients: 76.92 | optimizer_step: 576.46 [2021-10-08 20:18:40,856] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.88 | backward_microstep: 746.24 | backward_inner_microstep: 514.67 | backward_allreduce_microstep: 231.52 | step_microstep: 662.61

991/1000 [30:26<00:16, 1.83s/it][2021-10-08 20:33:42,880] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.33 | optimizer_gradients: 75.44 | optimizer_step: 565.35 [2021-10-08 20:33:42,881] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.61 | backward_microstep: 746.57 | backward_inner_microstep: 516.68 | backward_allreduce_microstep: 229.84 | step_microstep: 649.29 [2021-10-08 20:33:42,881] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.59 | backward: 746.55 | backward_inner: 516.65 | backward_allreduce: 229.82 | step: 649.27 992/1000 [30:28<00:14, 1.83s/it][2021-10-08 20:33:44,734] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.29 | optimizer_gradients: 82.71 | optimizer_step: 589.82 [2021-10-08 20:33:44,734] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.75 | backward_microstep: 743.88 | backward_inner_microstep: 514.41 | backward_allreduce_microstep: 229.43 | step_microstep: 681.02 [2021-10-08 20:33:44,734] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.72 | backward: 743.87 | backward_inner: 514.39 | backward_allreduce: 229.41 | step: 681.00 993/1000 [30:30<00:12, 1.84s/it][2021-10-08 20:33:46,590] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.29 | optimizer_gradients: 75.56 | optimizer_step: 583.36 [2021-10-08 20:33:46,591] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.50 | backward_microstep: 746.20 | backward_inner_microstep: 515.28 | backward_allreduce_microstep: 230.87 | step_microstep: 667.48 [2021-10-08 20:33:54,064] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.54 | backward: 746.87 | backward_inner: 514.92 | backward_allreduce: 231.88 | step: 678.25 998/1000 [30:39<00:03, 1.86s/it][2021-10-08 20:33:55,903] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 3.85 | optimizer_gradients: 78.16 | optimizer_step: 574.00 [2021-10-08 20:33:55,904] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.30 | backward_microstep: 740.36 | backward_inner_microstep: 514.58 | backward_allreduce_microstep: 225.73 | step_microstep: 661.18 [2021-10-08 20:33:55,904] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.27 | backward: 740.34 | backward_inner: 514.56 | backward_allreduce: 225.71 | step: 661.16 999/1000 [30:41<00:01, 1.86s/it][2021-10-08 20:33:57,734] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | optimizer_allgather: 4.67 | optimizer_gradients: 75.40 | optimizer_step: 568.76 [2021-10-08 20:33:57,735] [INFO] [logging.py:68:log_dist] [Rank 0] step=1000, skipped=17, lr=[9.444444444444445e-07, 9.444444444444445e-07], mom=[(0.9, 0.999), (0.9, 0.999)] [2021-10-08 20:33:57,735] [INFO] [timer.py:160:stop] 0/1000, SamplesPerSec=4.6612122376961915 [2021-10-08 20:33:57,735] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward_microstep: 316.21 | backward_microstep: 739.09 | backward_inner_microstep: 514.53 | backward_allreduce_microstep: 224.51 | step_microstep: 654.23 [2021-10-08 20:33:57,735] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.18 | backward: 739.07 | backward_inner: 514.51 | backward_allreduce: 224.49 | step: 654.21 {'loss': 0.1628, 'learning_rate': 9.444444444444445e-07, 'epoch': 52.63}

tjruwase commented 2 years ago

@suchig, thanks. Can you please share similar profiling log for non DeepSpeed case so we can compare the numbers?

suchig commented 2 years ago

Thanks. Are you fine if I run it with python cProfile? That gives a bunch of stats. I am unsure how to filter out backward, forward and optimizer from that.

I looked at pytorch profiler. Unfortunately the code is from UniLM examples. I am really unclear where exactly the model could be called since it in turn uses Trainer from Transformer.

tjruwase commented 2 years ago

@suchig, no worries. A thought that just crossed my mind that since you can run this model without DeepSpeed this probably means you don't need the ZeRO optimization. Can you remove the zero_optimization from your ds config?

suchig commented 2 years ago

I am using the below confi. If I remove zero optimization, I am not sure what will remain

{ "zero_optimization": { "stage": 2, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "allgather_partitions": true, "allgather_bucket_size": 2e8, "reduce_scatter": true, "reduce_bucket_size": 2e8, "overlap_comm": true, "contiguous_gradients": true } }

tjruwase commented 2 years ago

It is okay, to just leave "wall_clock_breakdown": true in the config.

tjruwase commented 2 years ago

@suchig, ZeRO optimization is just one of the features in the DeepSpeed library. It is possible to use DeepSpeed without enabling ZeRO. In particular ZeRO is designed for training extremely large models that have hundreds of billions of parameters. However, for smaller models the memory optimization techniques for ZeRO could hurt throughput. This is why I suggested disabling ZeRO since the model can fit without DeepSpeed. So running with a config like below will be useful to see the forward/backward/optimizer latencies without ZeRO:

{
  "wall_clock_breakdown": true
}
suchig commented 2 years ago

Thanks for the response. Below is the final step with statistics. [2021-10-10 20:51:02,182] [INFO] [logging.py:68:log_dist] [Rank 0] rank=0 time (ms) | forward: 316.13 | backward: 525.67 | backward_inner: 506.66 | backward_allreduce: 18.94 | step: 353.33 {'loss': 0.1575, 'learning_rate': 5.555555555555556e-08, 'epoch': 52.63}

Observation:

  1. backward_allreduce was always considerably less than prior attempts.
  2. It took 21 minutes (More than no deepspeed runs (18 mins), less than deepspeed run with zero optimization (30 mins)
  3. I am still unclear why this was slower than non deepspeed run. Is this because the architecture is not very complex to derive the benefit of deepspeed? Anything to do with versions? Or dataset is too simple?
  4. My understanding (and it could be wrong) is that 3D parallism brings parallel pipeline, data and model through ZeroOptimization. So, if Zero Optimization is removed, how would deepspeed benefit?
tjruwase commented 2 years ago

Thanks for sharing these results. I will address your observations with the view of helping you determine whether deepspeed is the best option you. But first, let me restate that ZeRO is just one of the many features of the DeepSpeed library. You can see a list of DeepSpeed features here.

  1. ZeRO is best used when model size (e.g., multi-billion parameters) or batch size will not the available GPU memory. The techniques that ZeRO uses to reduce memory consumption do make backward() slower, and so it is expected that you are seeing slower backward_allreduce with ZeRO enabled. Do you know the size of the model? Your results demonstrate that ZeRO is not appropriate for this model.
  2. DeepSpeed without ZeRO being 16% slower than non-deepspeed is a bit surprising. We can investigate further if you like.
  3. Similar to 2, it will require some investigation to figure out why non-deepspeed is faster for this case. We are certainly curious about it. Do keep in mind that our expectation is that deepspeed without any of extra features should be roughly as fast as non-deepspeed. So I think the real question is whether you have other reasons for using deepspeed besides it being as fast as non-deepspeed, such do you plan to scale your model or batch size significantly later?
  4. No, 3D parallelism simply means combining pipeline, model, and data parallelism. 3D is orthogonal to ZeRO, since ZeRO is just a form of data parallelism that is compatible with 3D. ZeRO is a type of data parallelism uses data partitioning (instead of data duplication) to avoid the memory overheads inherent in other traditional data parallelism approaches.
suchig commented 2 years ago

I am interested in investigating further. I have given the output without Zero offload. What else should I try to get to the root of this?

This model is a trial for us. We do have interest to expand to other model architectures and dataset. We could scale this for our real world dataset or other use cases if we see benefit.

tjruwase commented 2 years ago

The next steps would be to profile the non-deepspeed run to obtain the forward/backward/optimizer latencies and compare against the non-ZeRO deepspeed runs. Do you know how to obtain these latencies?

suchig commented 2 years ago

I was thinking of using torch. autograd.profiler during training. Or maybe cProfile which will give ton of output that I have to visualize in some fashion.

tjruwase commented 2 years ago

@suchig, I think profiler is an overkill for this case since we are interested in just the latencies of foward/backward/optimizer. I recommend a lighter weight approach using either Megatron or DeepSpeed timers.

suchig commented 2 years ago

Sounds good.

Forward timer - I could include timers before and after compute_loss in training step. Backward timer - I could include before and after loss.backward in training step Optimizer - I am not sure.

Please let me know if this would work.