Right way to use pytorch-operator for multi-node multi-gpu setup

lainisourgod commented 5 years ago

Hi! Suppose in my cluster I have 2 nodes with 2 gpus each. What is the better practice for using all 4 gpus:

To spawn 4 pods with 1 gpu per each or
To spawn 2 pods with 2 gpus each?

I've seen similar issues: #128 and #30, but they do not give clear instrunctions on what variant is best.

Also, as mentioned in mentioned issues, using multiple gpus per pod requires code changes according to 1 gpu case, so it would be nice to have an instruction on how to adapt training code to this situation.

Thanks for your job!

mengdong commented 5 years ago

Observe the same issue, ideally, I want to run: Node 1:

python -m torch.distributed.launch --nproc_per_node=2 / --nnodes=2 --node_rank=0 --master_addr=masterAddr / --master_port=1234 train.py .........

Node 2:

python -m torch.distributed.launch --nproc_per_node=2 / --nnodes=2 --node_rank=1 --master_addr=masterAddr / --master_port=1234 train.py .........

gaocegege commented 5 years ago

/assign @johnugeorge @gaocegege

johnugeorge commented 5 years ago

@meandmymind Unfortunately, we don't have results for it. It depends on the setup and the workloads that you use. The benchmark results can vary based on that.

lainisourgod commented 5 years ago

@johnugeorge Do you have plans on making these banchmarks?

What I'm really interested is how to actually test multi-gpu-per-pod setup.

I tested mnist-ddp example with following setup

Node1 - 2 gpu, Node2 - 2 gpu
Master - 1 replica, Worker - 1 replica
Master: limits: nvidia.com/gpu: 2, Worker: limits: nvidia.com/gpu: 2

I observed only 1 gpu/node was actually used, and the other one was like 8% gpu-util without any memory used.

I suppose it is because DDP is designed in a way that each gpu should be used by one and only one process. So, when 2 gpu's are provided to pod, only one of them is acquired by one process, and the other is just required by Pod.

In order to use multi-gpu-per-pod setup I needed to redefine LocalRank for each pod and spawn multiple processes per container (with launch.py) and then there is no use of Kubeflow for this setup.

I guess it should be provided clearly then that multi-gpu-per-pod setup is not the usecase pytorch-operator is designed for, so people won't try this.

But just theoretically, do you think inter-pod communication (1 gpu/pod case) costs more than inter-process communication (many gpu / pod)?

gaocegege commented 5 years ago

do you think inter-pod communication (1 gpu/pod case) costs more than inter-process communication (many gpu / pod)?

It depends on the network infra of your Kubernetes cluster. In some cases, yes, in other cases, no.

mengdong commented 5 years ago

@gaocegege Could you elaborate on what is the condition that inter-pod cost less than inter-process? Thanks

lainisourgod commented 5 years ago

@gaocegege I guess if some pod previously had 2 gpus (on one node) and then it divided into 2 pods with 1 gpus each, both pods will be allocated on the same node. So it seems to me that inter-pod connection overhead will be of no significance.

mengdong commented 5 years ago

@gaocegege I guess if some pod previously had 2 gpus (on one node) and then it divided into 2 pods with 1 gpus each, both pods will be allocated on the same node. So it seems to me that inter-pod connection overhead will be of no significance.

Know how much the overhead is would help too.

gaocegege commented 5 years ago

Know how much the overhead is would help too.

It differs from your network configuration (nic, network plugin and so on) in your cluster. I think it is better to run some evaluations to get the real info about your cluster.

BTW, gang scheduling is also supported for TFJob and PyTorch. It is helpful to avoid the inter-node communication. But not helpful to avoid inter-pod communication.

mengdong commented 4 years ago

@gaocegege in my experiment, on single 8*V100 node, 1 pod of 8V100 performances much better than 2 pods of 4V100 for some network bounded training task. I haven't looked into why this happens but worth investigating.

gaocegege commented 4 years ago

Yeah, I think so. I'd appreciate it if you could investigate on it. From my perspective, I think it is caused by the neural network itself, the framework implementation, and network. From pytorch-operator, I think there are little we can do to improve it.

jtfogarty commented 4 years ago

/kind question /area engprod /priority p2

wallarug commented 2 years ago

Hey @mengdong How did you achieve the multi-gpu in a single pod test? I am struggling. It doesn't want to work. Would be great to have a template YAML and PyTorch Python file. I am trying with 4xGPU in a single pod. Not working.

kubeflow / pytorch-operator

Right way to use pytorch-operator for multi-node multi-gpu setup #219