Closed lileiooo closed 2 years ago
The message is from here, which is a try-except block. You can remove that block and see where exactly it gets an error.
I have this try-except because the distributed training in torch.distributed may sometimes stuck on our system. If you get this error every time/iteration, it seems somewhere is not running well.