Open ljz756245026 opened 3 years ago
Have you solved this issue? I met the same problems too. But could not find any resources or solutions.
No! Bitfusion does not support DDP for the reason that some NCCL versions are not supported by Bitfusion. However, we cannot change the nccl version.
Recently, I have got a VM with 2 A100 GPU. I want to use these VM to run data parallel through Pytorch. However, I meet several problems with the environment. I have succeeded on my lab's server without bitfusion. I want to know that whether bitfusion does not support
torch.nn.DataParallel
(https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) or nccl (https://developer.nvidia.com/nccl).I am looking forward to your reply.