tsujuifu / pytorch_violet

A PyTorch implementation of VIOLET
137 stars 6 forks source link

Error step_pretrain on Rank #5

Closed lileiooo closed 2 years ago

lileiooo commented 2 years ago
    Hello, I run the pre-training model in the environment of 4 gpus, and Error step_pretrain on Rank 1, 3, 2 0 is displayed, but the pre-training is not successful。
tsujuifu commented 2 years ago

The message is from here, which is a try-except block. You can remove that block and see where exactly it gets an error.

I have this try-except because the distributed training in torch.distributed may sometimes stuck on our system. If you get this error every time/iteration, it seems somewhere is not running well.