Open dancingpipi opened 3 years ago
Update: experiment for bert-large on 4xv100(16GB)
Batch Size = 64 | NVIDIA-BERT | ZERO-0 | ZERO-1 | ZERO-2 | ZERO-3 |
---|---|---|---|---|---|
CUDA Memory(MB) | OOM | 15853 | 13509 | 13499 | 14237 |
Forward time(ms) | / | 98.19 | 98.3 | 96.88 | 317.15 |
Backward time(ms) | / | 186.42 | 185.42 | 900.62 | 600.45 |
Total time(ms) | / | 284.63 | 283.78 | 997.53 | 917.63 |
throughput(samples/s) | / | 899.41 | 902.12 | 256.63 | 278.98 |
PS:backward = backward_inner + backward_allreduce,
backward_inner | backward_allreduce | |
---|---|---|
ZeRO-1 | 184.97 | 0.02 |
ZeRO-2 | 183.62 | 718.28 |
ZeRO-3 | 391.50 | 234.34 |
my question:
I'm also seeing that Zero2 uses more memory than Zero1
I'm also seeing that Zero2 uses more memory than Zero1
Have you met the problem: ZeRO2 is slower than ZeRO3?
@dancingpipi, thanks for the questions.
ZeRO is designed for very large models, > 1B parameters, that would not otherwise fit available GPU memory. Similarly, the higher stages of ZeRO are meant for models that are too large for lower stages. In summary, ZeRO memory savings come at the cost of extra communication time, and configurable) memory overhead of communication buffers. The answers to your specific questions are
Please see this #467 for a discussion on tuning ZeRO memory consumption.
@dancingpipi, thanks for the questions.
ZeRO is designed for very large models, > 1B parameters, that would not otherwise fit available GPU memory. Similarly, the higher stages of ZeRO are meant for models that are too large for lower stages. In summary, ZeRO memory savings come at the cost of extra communication time, and configurable) memory overhead of communication buffers. The answers to your specific questions are
- All ZeRO stages have comparable memory usage because Bert-Large (~340M params) is smaller than 1B, the communication buffers are GBs by default, and the data parallelism degree (4) is quite small. Bert-Large is not model that needs ZeRO.
- ZeRO-2 backward is slower because gradient partitioning occurs during the backward pass and that requires all-reduce communication.
Please see this #467 for a discussion on tuning ZeRO memory consumption.
@tjruwase Thanks for your answer! Now I understand for question 1, but for question 2, to my knowledge, ZeRO-3 also need gradient partitioning and all-reduce during the backward pass. In additional, ZeRO-3 need parameter all-gather(may be broadcast) during backward pass. So ZeRO-3 is faster than ZeRO-2 still confusing me~
@dancingpipi, thanks for the questions.
ZeRO is designed for very large models, > 1B parameters, that would not otherwise fit available GPU memory. Similarly, the higher stages of ZeRO are meant for models that are too large for lower stages. In summary, ZeRO memory savings come at the cost of extra communication time, and configurable) memory overhead of communication buffers. The answers to your specific questions are
- All ZeRO stages have comparable memory usage because Bert-Large (~340M params) is smaller than 1B, the communication buffers are GBs by default, and the data parallelism degree (4) is quite small. Bert-Large is not model that needs ZeRO.
- ZeRO-2 backward is slower because gradient partitioning occurs during the backward pass and that requires all-reduce communication.
Please see this #467 for a discussion on tuning ZeRO memory consumption.
@tjruwase hi I'm using vit_gigantic_patch14_224 model with around 1.8B parameters but still observe that ZeRO-3 doesn't outperform ZeRO-1 and ZeRO-2 in memory reduction. I'm wondering if this is related to some default settings used by deepspeed beacause when using ZeRO implementation of fairscale, the reduced memory can match the calculations in the original paper.
@dancingpipi, I am not familiar with this model. If you want to analyze this together, can you please share the following
@tjruwase Not sure if you are actually referring to me. The vit_gigantic_patch14_224 is just a scaled ViT model for image classification with a standard transformer architecture. Here are the observed results:
Memory (MB) | |
---|---|
Baseline | 60963 |
FP16 | 42251 |
ZeRO-1 + FP16 | 20201 |
ZeRO-2 + FP16 | 21152 |
ZeRO-3 + FP16 | 20208 |
I'm using Adam opitmizer with a total training batch size 128 on 8 GPUs (16/gpu). With FP16 training, the gradients in fp16 format are supposed to occupy around 3.3G (1.8 2 10 ^ 9 / 1024 ^ 3) GPU memory. So ZeRO-2 is supposed to have 2.8G memory reduction compared with ZeRO-1. However, it's observed that memory increased instead. Also we cannot observe a memory reduction in ZeRO-3.
@syorami, apologies for my typo, I was referring to you :). Thanks for sharing these results.
@syorami, I suspect that the memory overhead of the intermediate buffers used for gradient and parameter partitioning is exceeding the savings. Specifically, compared to zero stage 1:
ZeRO-3 should reduce fp16 param + gradient memory per rank to 0.8GB (2*3.3/8), which is ~5.8GB savings
The relevant intermediate buffers are configured by the knobs shown here.
Specifically, the primary knobs for for ZeRO-2 are reduce_bucket_size
, and allgather_bucket_size
, and for ZeRO-3, are (additionally) stage3_max_live_parameters
and stage3_max_reuse_distance
. Please read the above document for more details.
In terms of next steps, could you please share the current values of these configuration knobs and then try reducing them by an order of magnitude. Please share any performance impact you notice as you modify these values. Thanks!
@tjruwase Thanks for your information. I would give those knobs a try. I'm using the suggested configs in doc. Here are the configs I'm using:
ZeRO-1:
{
"zero_optimization": {
"stage": 1,
"reduce_bucket_size": 5e8
}
}
ZeRO-2:
{
"zero_optimization": {
"stage": 2,
"contiguous_gradients": true,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"allgather_bucket_size": 5e8
}
}
ZeRO-3 (offload disabled to better compare ZeRO's performance):
"zero_optimization": {
"stage": 3,
"contiguous_gradients": true,
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_prefetch_bucket_size": 1e7,
"stage3_param_persistence_threshold": 1e5,
"reduce_bucket_size": 1e7,
"sub_group_size": 1e9
},
@syorami, thanks for sharing. I think those config values explain your observations.
You might the find the following analysis useful: https://github.com/microsoft/DeepSpeed/issues/467
I look forward to seeing your observations from playing with those config knobs. Thanks!
@syorami, below is a really nice memory requirements estimator that might be useful:
@tjruwase hi, here are my test results of ZeRO-2:
index | contiguous_gradients | overlap_comm | reduce_scatter | bucket_size | Memory |
---|---|---|---|---|---|
0 | True | True | True | 5e8 | 21152 |
1 | True | True | False | 5e8 | 21152 |
2 | True | False | True | 5e8 | 20198 |
3 | True | True | True | 1e8 | 19625 |
4 | True | True | True | 1e7 | 19284 |
5 | False | True | True | 5e8 | 19247 |
6 | False | False | True | 1e7 | 19247 |
The index 0 is my previous baseline of ZeRO-2 and I compare several knobs in experiments 1-5. It seems that some arguments like contiguous_gradients
and bucket_size
do affect the saved memories. However, combine them together (exp 6) could not lead to further savings. Maybe it's related to certain implementations.
After playing with knobs, it's much closer to theoretical calculations.
@syorami, thanks for sharing your findings. It is good that the knobs helped a bit, but I agree that we should see ~2.9GB savings with ZeRO-2 compared to ZeRO-1. Do you want to investigate further? I am happy to suggest more profiling and investigative directions. Thanks!
@tjruwase sure! Also I would like to share my results with fairscale's ZeRO implementation. It uses the same model and optimizer although other settings are a little different.
Memory (GB) | |
---|---|
AMP | 41.63 |
ZeRO-1 + AMP | 29.32 |
ZeRO-2 + AMP | 16.97 |
For ZeRO-1, the optimizer states are sharded and we can see the theoretical reduction is around 1.8 4 2 / 8 7 = 12.6GB. The actual reduction is 12.31GB.
For ZeRO-2, the gradients are sharded and the theoretical reduction is 1.8 4 / 8 * 7 = 6.3GB. When replacing torch's native DDP with fairscale's ShardedDDP
, the buckets are no longer used (link) and the reduction equals model states 7.2GB. The total theoretical reduction would be 6.3 + 7.2 = 13.5GB and the actual reduction 12.35GB still matches the calculations.
I think the functions offered by deepspeed are nested with each other thus we cannot observe the theoretical reductions.
@dancingpipi, thanks for the questions. ZeRO is designed for very large models, > 1B parameters, that would not otherwise fit available GPU memory. Similarly, the higher stages of ZeRO are meant for models that are too large for lower stages. In summary, ZeRO memory savings come at the cost of extra communication time, and configurable) memory overhead of communication buffers. The answers to your specific questions are
- All ZeRO stages have comparable memory usage because Bert-Large (~340M params) is smaller than 1B, the communication buffers are GBs by default, and the data parallelism degree (4) is quite small. Bert-Large is not model that needs ZeRO.
- ZeRO-2 backward is slower because gradient partitioning occurs during the backward pass and that requires all-reduce communication.
Please see this #467 for a discussion on tuning ZeRO memory consumption.
@tjruwase hi I'm using vit_gigantic_patch14_224 model with around 1.8B parameters but still observe that ZeRO-3 doesn't outperform ZeRO-1 and ZeRO-2 in memory reduction. I'm wondering if this is related to some default settings used by deepspeed beacause when using ZeRO implementation of fairscale, the reduced memory can match the calculations in the original paper.
Hello, when I'm testing llama 13B model with 8 A100 GPUs, I also find that ZeRO3 is a bit faster than ZeRO2, even in the same batch size. Any update on this issue? I think we should use profiler tools(such as Nvidia Nsight sys) to have a deep analysis. Through theoretical analysis, I suppose ZeRO3 has addition comm than ZeRO2. However, the additional comm may be erased by the prefetch optimization. But it still can't explain why ZeRO3 is faster.
Follow the bing_bert tutorial, my deepspeed_config is:
The CUDA Memory usage for stage 1 is 8900MB per GPU The CUDA Memory usage for stage 2 is 9600MB per GPU
And the ZeRO-2 is much slower than ZeRO-1 in training speed.
Any help will be appreciate~