microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.67k stars 4.15k forks source link

Why ZeRO-2 use more CUDA Memory than ZeRO-1? #1302

Open dancingpipi opened 3 years ago

dancingpipi commented 3 years ago

Follow the bing_bert tutorial, my deepspeed_config is:

{
  "train_batch_size": 4096,
  "train_micro_batch_size_per_gpu": 32,
  "steps_per_print": 1000,
  "prescale_gradients": false,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 6e-3,
      "betas": [
        0.9,
        0.99
      ],
      "eps": 1e-8,
      "weight_decay": 0.01
    }
  },

  "zero_optimization": {
    "stage": 1,
    "allgather_partitions": true,
    "allgather_bucket_size": 5e8,
    "overlap_comm": false,
    "reduce_scatter": true,
    "reduce_bucket_size": 5e8,
    "contiguous_gradients": true,
    "grad_hooks": true,
    "round_robin_gradients": false
  },

  "scheduler": {
    "type": "WarmupLR",
    "params": {
        "warmup_min_lr": 1e-8,
        "warmup_max_lr": 6e-3
    }
  },
  "gradient_clipping": 1.0,

  "wall_clock_breakdown": false,

  "fp16": {
    "enabled": true,
    "loss_scale": 0
  },
  "sparse_attention": {
    "mode": "fixed",
    "block": 16,
    "different_layout_per_head": true,
    "num_local_blocks": 4,
    "num_global_blocks": 1,
    "attention": "bidirectional",
    "horizontal_global_attention": false,
    "num_different_global_patterns": 4
  }
}

The CUDA Memory usage for stage 1 is 8900MB per GPU The CUDA Memory usage for stage 2 is 9600MB per GPU

And the ZeRO-2 is much slower than ZeRO-1 in training speed.

Any help will be appreciate~

dancingpipi commented 3 years ago

Update: experiment for bert-large on 4xv100(16GB)

Batch Size = 64 NVIDIA-BERT ZERO-0 ZERO-1 ZERO-2 ZERO-3
CUDA Memory(MB) OOM 15853 13509 13499 14237
Forward time(ms) / 98.19 98.3 96.88 317.15
Backward time(ms) / 186.42 185.42 900.62 600.45
Total time(ms) / 284.63 283.78 997.53 917.63
throughput(samples/s) / 899.41 902.12 256.63 278.98

PS:backward = backward_inner + backward_allreduce,

  backward_inner backward_allreduce
ZeRO-1 184.97 0.02
ZeRO-2 183.62 718.28
ZeRO-3 391.50 234.34

my question:

  1. Why ZeRO2 and ZeRO3 are not superior to ZeRO1 at Memory Usage?
  2. Why ZeRO2 backward is slower than ZeRO3? To my knowledge, ZeRO-2 does not require additional communication
thecooltechguy commented 3 years ago

I'm also seeing that Zero2 uses more memory than Zero1

dancingpipi commented 3 years ago

I'm also seeing that Zero2 uses more memory than Zero1

Have you met the problem: ZeRO2 is slower than ZeRO3?

tjruwase commented 3 years ago

@dancingpipi, thanks for the questions.

ZeRO is designed for very large models, > 1B parameters, that would not otherwise fit available GPU memory. Similarly, the higher stages of ZeRO are meant for models that are too large for lower stages. In summary, ZeRO memory savings come at the cost of extra communication time, and configurable) memory overhead of communication buffers. The answers to your specific questions are

  1. All ZeRO stages have comparable memory usage because Bert-Large (~340M params) is smaller than 1B, the communication buffers are GBs by default, and the data parallelism degree (4) is quite small. Bert-Large is not model that needs ZeRO.
  2. ZeRO-2 backward is slower because gradient partitioning occurs during the backward pass and that requires all-reduce communication.

Please see this #467 for a discussion on tuning ZeRO memory consumption.

dancingpipi commented 3 years ago

@dancingpipi, thanks for the questions.

ZeRO is designed for very large models, > 1B parameters, that would not otherwise fit available GPU memory. Similarly, the higher stages of ZeRO are meant for models that are too large for lower stages. In summary, ZeRO memory savings come at the cost of extra communication time, and configurable) memory overhead of communication buffers. The answers to your specific questions are

  1. All ZeRO stages have comparable memory usage because Bert-Large (~340M params) is smaller than 1B, the communication buffers are GBs by default, and the data parallelism degree (4) is quite small. Bert-Large is not model that needs ZeRO.
  2. ZeRO-2 backward is slower because gradient partitioning occurs during the backward pass and that requires all-reduce communication.

Please see this #467 for a discussion on tuning ZeRO memory consumption.

@tjruwase Thanks for your answer! Now I understand for question 1, but for question 2, to my knowledge, ZeRO-3 also need gradient partitioning and all-reduce during the backward pass. In additional, ZeRO-3 need parameter all-gather(may be broadcast) during backward pass. So ZeRO-3 is faster than ZeRO-2 still confusing me~

syorami commented 1 year ago

@dancingpipi, thanks for the questions.

ZeRO is designed for very large models, > 1B parameters, that would not otherwise fit available GPU memory. Similarly, the higher stages of ZeRO are meant for models that are too large for lower stages. In summary, ZeRO memory savings come at the cost of extra communication time, and configurable) memory overhead of communication buffers. The answers to your specific questions are

  1. All ZeRO stages have comparable memory usage because Bert-Large (~340M params) is smaller than 1B, the communication buffers are GBs by default, and the data parallelism degree (4) is quite small. Bert-Large is not model that needs ZeRO.
  2. ZeRO-2 backward is slower because gradient partitioning occurs during the backward pass and that requires all-reduce communication.

Please see this #467 for a discussion on tuning ZeRO memory consumption.

@tjruwase hi I'm using vit_gigantic_patch14_224 model with around 1.8B parameters but still observe that ZeRO-3 doesn't outperform ZeRO-1 and ZeRO-2 in memory reduction. I'm wondering if this is related to some default settings used by deepspeed beacause when using ZeRO implementation of fairscale, the reduced memory can match the calculations in the original paper.

tjruwase commented 1 year ago

@dancingpipi, I am not familiar with this model. If you want to analyze this together, can you please share the following

  1. Observed memory usage for zero 1, 2, and 3.
  2. Training batch size
  3. What optimizer is used for this model
syorami commented 1 year ago

@tjruwase Not sure if you are actually referring to me. The vit_gigantic_patch14_224 is just a scaled ViT model for image classification with a standard transformer architecture. Here are the observed results:

Memory (MB)
Baseline 60963
FP16 42251
ZeRO-1 + FP16 20201
ZeRO-2 + FP16 21152
ZeRO-3 + FP16 20208

I'm using Adam opitmizer with a total training batch size 128 on 8 GPUs (16/gpu). With FP16 training, the gradients in fp16 format are supposed to occupy around 3.3G (1.8 2 10 ^ 9 / 1024 ^ 3) GPU memory. So ZeRO-2 is supposed to have 2.8G memory reduction compared with ZeRO-1. However, it's observed that memory increased instead. Also we cannot observe a memory reduction in ZeRO-3.

tjruwase commented 1 year ago

@syorami, apologies for my typo, I was referring to you :). Thanks for sharing these results.

tjruwase commented 1 year ago

@syorami, I suspect that the memory overhead of the intermediate buffers used for gradient and parameter partitioning is exceeding the savings. Specifically, compared to zero stage 1:

  1. ZeRO-2 should reduce fp16 gradient memory per rank to 0.4GB (3.3/8), which is ~2.9GB savings
  2. ZeRO-3 should reduce fp16 param + gradient memory per rank to 0.8GB (2*3.3/8), which is ~5.8GB savings

    The relevant intermediate buffers are configured by the knobs shown here.

Specifically, the primary knobs for for ZeRO-2 are reduce_bucket_size, and allgather_bucket_size, and for ZeRO-3, are (additionally) stage3_max_live_parameters and stage3_max_reuse_distance. Please read the above document for more details.

In terms of next steps, could you please share the current values of these configuration knobs and then try reducing them by an order of magnitude. Please share any performance impact you notice as you modify these values. Thanks!

syorami commented 1 year ago

@tjruwase Thanks for your information. I would give those knobs a try. I'm using the suggested configs in doc. Here are the configs I'm using:

ZeRO-1:

{
    "zero_optimization": {
        "stage": 1,
        "reduce_bucket_size": 5e8
    }
}

ZeRO-2:

{
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": true,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e8,
        "allgather_bucket_size": 5e8
    }
}

ZeRO-3 (offload disabled to better compare ZeRO's performance):

  "zero_optimization": {
    "stage": 3,
    "contiguous_gradients": true,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_prefetch_bucket_size": 1e7,
    "stage3_param_persistence_threshold": 1e5,
    "reduce_bucket_size": 1e7,
    "sub_group_size": 1e9
  },
tjruwase commented 1 year ago

@syorami, thanks for sharing. I think those config values explain your observations.

You might the find the following analysis useful: https://github.com/microsoft/DeepSpeed/issues/467

I look forward to seeing your observations from playing with those config knobs. Thanks!

tjruwase commented 1 year ago

@syorami, below is a really nice memory requirements estimator that might be useful:

https://deepspeed.readthedocs.io/en/latest/memory.html

syorami commented 1 year ago

@tjruwase hi, here are my test results of ZeRO-2:

index contiguous_gradients overlap_comm reduce_scatter bucket_size Memory
0 True True True 5e8 21152
1 True True False 5e8 21152
2 True False True 5e8 20198
3 True True True 1e8 19625
4 True True True 1e7 19284
5 False True True 5e8 19247
6 False False True 1e7 19247

The index 0 is my previous baseline of ZeRO-2 and I compare several knobs in experiments 1-5. It seems that some arguments like contiguous_gradients and bucket_size do affect the saved memories. However, combine them together (exp 6) could not lead to further savings. Maybe it's related to certain implementations.

After playing with knobs, it's much closer to theoretical calculations.

tjruwase commented 1 year ago

@syorami, thanks for sharing your findings. It is good that the knobs helped a bit, but I agree that we should see ~2.9GB savings with ZeRO-2 compared to ZeRO-1. Do you want to investigate further? I am happy to suggest more profiling and investigative directions. Thanks!

syorami commented 1 year ago

@tjruwase sure! Also I would like to share my results with fairscale's ZeRO implementation. It uses the same model and optimizer although other settings are a little different.

Memory (GB)
AMP 41.63
ZeRO-1 + AMP 29.32
ZeRO-2 + AMP 16.97

For ZeRO-1, the optimizer states are sharded and we can see the theoretical reduction is around 1.8 4 2 / 8 7 = 12.6GB. The actual reduction is 12.31GB. For ZeRO-2, the gradients are sharded and the theoretical reduction is 1.8 4 / 8 * 7 = 6.3GB. When replacing torch's native DDP with fairscale's ShardedDDP, the buckets are no longer used (link) and the reduction equals model states 7.2GB. The total theoretical reduction would be 6.3 + 7.2 = 13.5GB and the actual reduction 12.35GB still matches the calculations.

I think the functions offered by deepspeed are nested with each other thus we cannot observe the theoretical reductions.

kisseternity commented 1 year ago

@dancingpipi, thanks for the questions. ZeRO is designed for very large models, > 1B parameters, that would not otherwise fit available GPU memory. Similarly, the higher stages of ZeRO are meant for models that are too large for lower stages. In summary, ZeRO memory savings come at the cost of extra communication time, and configurable) memory overhead of communication buffers. The answers to your specific questions are

  1. All ZeRO stages have comparable memory usage because Bert-Large (~340M params) is smaller than 1B, the communication buffers are GBs by default, and the data parallelism degree (4) is quite small. Bert-Large is not model that needs ZeRO.
  2. ZeRO-2 backward is slower because gradient partitioning occurs during the backward pass and that requires all-reduce communication.

Please see this #467 for a discussion on tuning ZeRO memory consumption.

@tjruwase hi I'm using vit_gigantic_patch14_224 model with around 1.8B parameters but still observe that ZeRO-3 doesn't outperform ZeRO-1 and ZeRO-2 in memory reduction. I'm wondering if this is related to some default settings used by deepspeed beacause when using ZeRO implementation of fairscale, the reduced memory can match the calculations in the original paper.

Hello, when I'm testing llama 13B model with 8 A100 GPUs, I also find that ZeRO3 is a bit faster than ZeRO2, even in the same batch size. Any update on this issue? I think we should use profiler tools(such as Nvidia Nsight sys) to have a deep analysis. Through theoretical analysis, I suppose ZeRO3 has addition comm than ZeRO2. However, the additional comm may be erased by the prefetch optimization. But it still can't explain why ZeRO3 is faster.