vmware-archive / bitfusion-with-kubernetes-integration

Bitfusion with Kubernetes Integration Support
51 stars 23 forks source link

Can we use bitfusion to run Distributed Data Parallel Pytorch code? #43

Open ljz756245026 opened 3 years ago

ljz756245026 commented 3 years ago

Recently, I have got a VM with 2 A100 GPU. I want to use these VM to run data parallel through Pytorch. However, I meet several problems with the environment. I have succeeded on my lab's server without bitfusion. I want to know that whether bitfusion does not support torch.nn.DataParallel(https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) or nccl (https://developer.nvidia.com/nccl).

I am looking forward to your reply.

YanJenHuang commented 1 year ago

Have you solved this issue? I met the same problems too. But could not find any resources or solutions.

ljz756245026 commented 1 year ago

No! Bitfusion does not support DDP for the reason that some NCCL versions are not supported by Bitfusion. However, we cannot change the nccl version.