microsoft / Oscar

Oscar and VinVL
MIT License
1.04k stars 252 forks source link

fix the deadlock issue when using distributed training in vqa finetune #196

Closed Light-V closed 2 years ago

Light-V commented 2 years ago

The main process with local_rank 0 will wait forever because other process will not call torch.distributed.barrier()