[REQUEST] Some comments on using deepspeed to know about the distributed effect

Baibaifan commented 2 years ago

Is your feature request related to a problem? Please describe.

If I want to experiment with zero-3 to train 345m GPT model, how to set the relevant configuration of zero-3? At present, I use the default configuration and find that the training speed is not very fast.

model link: https://github.com/microsoft/DeepSpeedExamples/tree/174ae3bc8dbb688cfaccb4afa15d6e2cdbe19ce5/Megatron-LM-v1.1.5-ZeRO3

#ZeRO Configs
stage=3
reduce_scatter=true
contigious_gradients=true
rbs=50000000
agbs=5000000000

what's that mean?

{
  "train_batch_size": 64,
  "gradient_accumulation_steps": 1,
  "steps_per_print": 1,
  "zero_optimization": {
    "stage": 3,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_prefetch_bucket_size": 1e7,
    "stage3_param_persitence_threshold": 1e5,
    "reduce_bucket_size": 1e7,
    "contiguous_gradients": true
  },
  "gradient_clipping": 1.0,
  "fp16": {
    "enabled": true,
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "wall_clock_breakdown": true,
  "zero_allow_untested_optimizer": false
}

what is stage3_max_live_parameters? Does it appear in the API introduction？

Describe the solution you'd like I hope to give a classic configuration model, such as 345m GPT model of single 8-card, which can facilitate users to experience the algorithm effect of DeepSpeed.

Describe alternatives you've considered In the released configuration information, you should give a classic configuration with the best performance, and then give your test performance, which can be used as a training reference.

Additional context I hope to give the performance data of the classic 345m GPT model of single 8-card under zero-3 as a performance reference. For example, the GPT model can be adjusted under the default configuration of the GPT 345m model.

tjruwase commented 2 years ago

@Baibaifan, thanks for your question. For smaller models like 345m, zero-3 will be slower than lower stages of zero because of the overheads of parameter partitioning. Please see our page on autotuning for 345m gpt2 here. Is that helpful?

Baibaifan commented 2 years ago

@Baibaifan, thanks for your question. For smaller models like 345m, zero-3 will be slower than lower stages of zero because of the overheads of parameter partitioning. Please see our page on autotuning for 345m gpt2 here. Is that helpful?

thanks for your advice，but i just want to test the speed of the deepspeed zero-3 in 345m GPT model. please give me a Specific configuration.

Baibaifan commented 2 years ago

@Baibaifan, thanks for your question. For smaller models like 345m, zero-3 will be slower than lower stages of zero because of the overheads of parameter partitioning. Please see our page on autotuning for 345m gpt2 here. Is that helpful?

i kown the diffs between the zero-3 and zero-2. if i just to choose the best way to train a model, how to use the configs of zero-3.

tjruwase commented 2 years ago

Got it. Unfortunately, we don't have much insight to share because we have not investigated this scenario. However, you can try the following:

Increase micro-batch size as much as possible to exploit the memory savings of zero3
Tune the partitioning parameters to reduce partitioning overhead: stage3_max_live_parameters, stage3_max_reuse_distance, stage3_prefetch_bucket_size, and stage3_param_persistence_threshold.

Baibaifan commented 2 years ago

Got it. Unfortunately, we don't have much insight to share because we have not investigated this scenario. However, you can try the following:

Increase micro-batch size as much as possible to exploit the memory savings of zero3

Tune the partitioning parameters to reduce partitioning overhead: stage3_max_live_parameters, stage3_max_reuse_distance, stage3_prefetch_bucket_size, and stage3_param_persistence_threshold.

Thanks a lot, and it is expected to give the performance test baseline of 345m GPT model.

microsoft / DeepSpeed

[REQUEST] Some comments on using deepspeed to know about the distributed effect #1782