can't run torch.distributed.scatter with nccl backend

mingkai-zheng / WCL

Weakly Supervised Contrastive Learning

39 stars 7 forks source link

can't run torch.distributed.scatter with nccl backend #6

Closed pablotalavante closed 1 year ago

pablotalavante commented 2 years ago

Hello, I tried to run your code on a single GPU making some modifications to the code but maintaining the DataPararell code, but I run into a problem here

where I have to run torch.distributed.scatter(mask1, mask1_list, 0).

You initialize DDP with the backend=nccl, but according to the docs this operation is not allowed with this backend.

Do you have any idea how to overcome this? Thank you!

mingkai-zheng commented 1 year ago

Sorry, I just realized that I forgot to reply to your question. Since I was using our internal deep-learning library to train the model, so I did not meet this issue. However, to resolve this issue, a simple way is to calculate the 1-nn adjacency matrix on each GPU separately, so you don't have to scatter it from rank0.