Closed pablotalavante closed 1 year ago
Sorry, I just realized that I forgot to reply to your question. Since I was using our internal deep-learning library to train the model, so I did not meet this issue. However, to resolve this issue, a simple way is to calculate the 1-nn adjacency matrix on each GPU separately, so you don't have to scatter it from rank0.
Hello, I tried to run your code on a single GPU making some modifications to the code but maintaining the DataPararell code, but I run into a problem here
where I have to run
torch.distributed.scatter(mask1, mask1_list, 0)
.You initialize DDP with the
backend=nccl
, but according to the docs this operation is not allowed with this backend.Do you have any idea how to overcome this? Thank you!