microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
34.93k stars 4.06k forks source link

Deepspeed stage 3 hanging after 1st validation sample #5635

Open AceMcAwesome77 opened 3 months ago

AceMcAwesome77 commented 3 months ago

I am training a pytorch-lightning encoder-decoder model on 2 GPUs. The model trains fine (with roughly the 2x acceleration expected) when using deepspeed stage 2. When using deepspeed stage 3, the model passes through the full training step of epoch 1 fine, but when it gets to the validation step, it passes through just 1 sample before permanently hanging with no error message. Logging appears to show that it is running 2 validation samples at once (even though val batch size = 1), and the hanging starts right after one of the validation samples has finished and produced a result while the other one is still going. Is this a common problem when using deepspeed stage 3? If so, what is the fix? The model has custom code that would be too big to paste here, but I can try to paste relevant sections of it if there is a particular spot that would be useful to see. Thanks!