I am training a pytorch-lightning encoder-decoder model on 2 GPUs. The model trains fine (with roughly the 2x acceleration expected) when using deepspeed stage 2. When using deepspeed stage 3, the model passes through the full training step of epoch 1 fine, but when it gets to the validation step, it passes through just 1 sample before permanently hanging with no error message. Logging appears to show that it is running 2 validation samples at once (even though val batch size = 1), and the hanging starts right after one of the validation samples has finished and produced a result while the other one is still going. Is this a common problem when using deepspeed stage 3? If so, what is the fix? The model has custom code that would be too big to paste here, but I can try to paste relevant sections of it if there is a particular spot that would be useful to see. Thanks!
I am training a pytorch-lightning encoder-decoder model on 2 GPUs. The model trains fine (with roughly the 2x acceleration expected) when using deepspeed stage 2. When using deepspeed stage 3, the model passes through the full training step of epoch 1 fine, but when it gets to the validation step, it passes through just 1 sample before permanently hanging with no error message. Logging appears to show that it is running 2 validation samples at once (even though val batch size = 1), and the hanging starts right after one of the validation samples has finished and produced a result while the other one is still going. Is this a common problem when using deepspeed stage 3? If so, what is the fix? The model has custom code that would be too big to paste here, but I can try to paste relevant sections of it if there is a particular spot that would be useful to see. Thanks!