microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.49k stars 4.12k forks source link

[BUG] pipline engine's training stucked when zero=1 #5792

Open janelu9 opened 3 months ago

janelu9 commented 3 months ago

pp_size = 8 stage 0 contains a vision encoder of 45 layers stage 1~7 contain 56 layers of decoder zero 0 is well but zero 1 and bf16/fp16 failed much more GPU memory will be saved if zero 1 runs well

janelu9 commented 3 months ago

Encoder may not be used sometimes, because images did not alwayes exist in questions.

janelu9 commented 3 months ago

Are there any operations not allowed between stages if zero is 1?