Training Error - Githubissues

jayleicn / singularity

[ACL 2023] Official PyTorch code for Singularity model in "Revealing Single Frame Bias for Video-and-Language Learning"

https://arxiv.org/abs/2206.03428

MIT License

129 stars 13 forks source link

Training Error #7

Closed yuanze-lin closed 2 years ago

yuanze-lin commented 2 years ago

I have tried to training the models, however, after about 1 epoch, the error will appear, I use 8 A5000 for training.

jayleicn commented 2 years ago

Hi @yzleroy, this does not seem like a bug from our code but instead from pytorch distributed. Could you try to resume the training and see if this is a persistent issue?

yuanze-lin commented 2 years ago

Hi @yzleroy, this does not seem like a bug from our code but instead from pytorch distributed. Could you try to resume the training and see if this is a persistent issue?

I try to train the models from scratch multiple times, this error always appears, so I don't know how to solve this problem.

jayleicn commented 2 years ago

Can you try single-GPU training to see what happens?

jayleicn commented 2 years ago

Hi @yzleroy, could you write a bit about how you solved the issue? This is helpful for future readers. Thanks

yuanze-lin commented 2 years ago

Hi @yzleroy, could you write a bit about how you solved the issue? This is helpful for future readers. Thanks

I found it was caused by the hardware devices : )