Closed yuanze-lin closed 2 years ago
Hi @yzleroy, this does not seem like a bug from our code but instead from pytorch distributed. Could you try to resume the training and see if this is a persistent issue?
Hi @yzleroy, this does not seem like a bug from our code but instead from pytorch distributed. Could you try to resume the training and see if this is a persistent issue?
I try to train the models from scratch multiple times, this error always appears, so I don't know how to solve this problem.
Can you try single-GPU training to see what happens?
Hi @yzleroy, could you write a bit about how you solved the issue? This is helpful for future readers. Thanks
Hi @yzleroy, could you write a bit about how you solved the issue? This is helpful for future readers. Thanks
I found it was caused by the hardware devices : )
I have tried to training the models, however, after about 1 epoch, the error will appear, I use 8 A5000 for training.