Closed getchebarne closed 2 years ago
Update: setting torch.cuda.set_device(rank)
before initializing the process group seems to fix this issue. Still, this is not needed with DistributedDataParallel.
The torch.cuda.set_device(rank) is needed before calling DistributedModelParallel. I think it's the same for DistributedDataParallel, please see the document here https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html#distributeddataparallel
Update: setting
torch.cuda.set_device(rank)
before initializing the process group seems to fix this issue. Still, this is not needed with DistributedDataParallel. this is effective, thanks!
Hello,
I'm trying to train a TorchRec model in a single node with two Nvidia A100 GPUs.
I installed TorchRec and FBGEMM from source. My TorchRec version:
Below's the script I'm trying to run. I've replaced the model with a very simple EBC to discard issues on my model's architecture.
When trying to run this code, I get the following error. I used
export NCCL_DEBUG=INFO
to get NCCL's logs.Reading the NCCL logs, I noticed this two lines:
Could this be the cause of the issue? If so, how do I solve it? I ran another regular PyTorch script with DistributedDataParallel and didn't have any issues with NCCL; the script ran fine.