Open Baibaifan opened 2 years ago
@Baibaifan, thanks for your question. For smaller models like 345m, zero-3 will be slower than lower stages of zero because of the overheads of parameter partitioning. Please see our page on autotuning for 345m gpt2 here. Is that helpful?
@Baibaifan, thanks for your question. For smaller models like 345m, zero-3 will be slower than lower stages of zero because of the overheads of parameter partitioning. Please see our page on autotuning for 345m gpt2 here. Is that helpful?
thanks for your advice,but i just want to test the speed of the deepspeed zero-3 in 345m GPT model. please give me a Specific configuration.
@Baibaifan, thanks for your question. For smaller models like 345m, zero-3 will be slower than lower stages of zero because of the overheads of parameter partitioning. Please see our page on autotuning for 345m gpt2 here. Is that helpful?
i kown the diffs between the zero-3 and zero-2. if i just to choose the best way to train a model, how to use the configs of zero-3.
Got it. Unfortunately, we don't have much insight to share because we have not investigated this scenario. However, you can try the following:
stage3_max_live_parameters
, stage3_max_reuse_distance
, stage3_prefetch_bucket_size
, and stage3_param_persistence_threshold
.Got it. Unfortunately, we don't have much insight to share because we have not investigated this scenario. However, you can try the following:
- Increase micro-batch size as much as possible to exploit the memory savings of zero3
- Tune the partitioning parameters to reduce partitioning overhead:
stage3_max_live_parameters
,stage3_max_reuse_distance
,stage3_prefetch_bucket_size
, andstage3_param_persistence_threshold
.
Thanks a lot, and it is expected to give the performance test baseline of 345m GPT model.
Is your feature request related to a problem? Please describe.
If I want to experiment with zero-3 to train 345m GPT model, how to set the relevant configuration of zero-3? At present, I use the default configuration and find that the training speed is not very fast.
model link: https://github.com/microsoft/DeepSpeedExamples/tree/174ae3bc8dbb688cfaccb4afa15d6e2cdbe19ce5/Megatron-LM-v1.1.5-ZeRO3
what's that mean?
what is stage3_max_live_parameters? Does it appear in the API introduction?
Describe the solution you'd like I hope to give a classic configuration model, such as 345m GPT model of single 8-card, which can facilitate users to experience the algorithm effect of DeepSpeed.
Describe alternatives you've considered In the released configuration information, you should give a classic configuration with the best performance, and then give your test performance, which can be used as a training reference.
Additional context I hope to give the performance data of the classic 345m GPT model of single 8-card under zero-3 as a performance reference. For example, the GPT model can be adjusted under the default configuration of the GPT 345m model.