Describe the bug
There is a problem with asynchronous communication in zero stage2 by using overlap_comm.
To Reproduce
Steps to reproduce the behavior:
Use deepspeed zero-2 on the hugging face to train the bloomz-7b1-mt model. When you enable overlap_comm = true and control the randomness, you still find that the loss is different every time.
Describe the bug There is a problem with asynchronous communication in zero stage2 by using
overlap_comm
.To Reproduce Steps to reproduce the behavior: Use deepspeed zero-2 on the hugging face to train the bloomz-7b1-mt model. When you enable
overlap_comm = true
and control the randomness, you still find that the loss is different every time.model: bloomz-7b1-mt zero2-config:
System info (please complete the following information):
Launcher context torchrun
Docker context nvcr.io/nvidia/pytorch:22.07-py3