kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.59k stars 693 forks source link

what is the standardized distributed training method for pytorch-operator #1823

Closed ThomaswellY closed 1 year ago

ThomaswellY commented 1 year ago

Hi,I have been deployed pytorch-operator for distributed training on k8s cluster, and struggled with this issue for a while. Here is my yaml. ( my k8s can only schedule two nodes, named gpu-233 and gpu-44, all the command in my case is excuted on the node of gpu-233. I just did it the same way in mpi-operator successfully ) image image

In my case, i have two nodes named gpu-233 and gpu-44 with 3 nvidia gpus and 1 nvidia gps respectly, and wanna start two distributed training process, one should be excuted by the node "gpu-233" with one gpu on the pod "mmpose-master-0", and another one should be excuted by the node "gpu-44" with one gpu on the pod "mmpose-worker-0"

After "kubectl apply -f my_yaml" for a while, when submitting "kubectl logs mmpose-worker-0", i can see no log info from the pod of worker, while the pod of master prints correct logs by the same command. At the same time, i found both the nodes of "gpu-233" and "gpu-44" has occupied one gpu respectly, which can be confirmed by "kubectl-view-allocations -r gpu".

Thus, i infer that the pod of worker has applied for node "gpu-44" and occupied one gpu successfully, but did not excute the torch.distribute.launch successfully.

the info about pod of worker are shown below: submitting "kubectl describe pod mmpose-worker-0" image image

After a period of time, the pod of worker report error, and the logs about error is shown below:

image

I know there have been a bundant of topics about distributed training with torch.distribute.launch in pytorch-operator, and your project members have left a lot of patient responses on these topics, I have to say thank you guys for such great works!

But please forgive my carelessness, i did can not found an explicit standardized method to launch an distributed training with specific nodes and gpus.

Do you think, what's going wrong in my case? @tenzen-y @johnugeorge

Thank you in advance ! any hints would be helpful to me~

kuizhiqing commented 1 year ago

You may try to understand how MASTER works, you have 2 pods with different IPs, how they can talk to each other, it comes the master service.

In your case, you setup your MASTER in 127.0.0.1 which means setup the master service locally, then the 2 pods will wait for another peer to join since NNODES=2 means each master should wait 2 nodes to start.

ThomaswellY commented 1 year ago

@kuizhiqing Thanks for your reply my friend~ the available docs about config of pytorch-operator are insufficient for me, and my background knowledge on deloying pytorch-operator is based on mpi-operator, where "launcher" declares the command to launch training, and "worker" calls resources.

But it seems different in pytorch-operator.

I don't know if "MASTER_ADDR" "MASTER_PORT"and "node_rank" should be manually setted like what in my yaml ?

I did another test, and setted node to 1 for both master and worker, however, only master can finish its training successfully. I also setted cleanPodPolicy to None to avoid worker being cleared up, and commit "kubectl get events --sort-by='{.metadata.creationTimestamp}'| grep mmpose-worker-0" to see the events in worker, and here is result: image the pod status is shown below: image master is finished, and releasd gpu, while worker is still runing with no logs printed out, and gpu is still be occupied.

I think there must be something missing in my yaml, thus worker have to wait for something to start runing even the status shows the pod of worker is being runing already.

Supposing i’m going to deploy distributed training with 2 nodes(gpu-233 and gpu-44) with one gpu per node, how should i modify the above yaml?

I have being rushing to deploy the distributed training on k8s and really want a solution, Thank you my friend in advance !

deepanker13 commented 1 year ago

hi @ThomaswellY , I think you need to make multiple changes i.e your yaml plus code, in the command section of your yaml explicitly specifying torch.distributed.launch defeats the purpose of PyTorch operator. In your code make changes to environment variables for PyTorch following this post https://lambdalabs.com/blog/multi-node-pytorch-distributed-training-guide and accordingIy in the docker file change the entrypoint and cmd to :- ENTRYPOINT ["python3",relative path to your script according to work dir] CMD []

After the above changes you can use the below yaml as reference, it will definitely work:-

apiVersion: "kubeflow.org/v1" kind: "PyTorchJob" metadata: name: "pytorch-dist" namespace: training spec: pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: metadata: labels: kind: "deepanker2" annotations: test_vesrion: "1" sidecar.istio.io/inject: "false" spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms:

ThomaswellY commented 1 year ago

@deepanker13 Hi, thanks for your kind help~ could you please give me another link? cause it is out of date [https://lambdalabs.com/blog/multi-node-pytorch-distributed-training-guide](https://github.com/kubeflow/training-operator/issues/url) I'll give you a quick reply after refering to your post.

tenzen-y commented 1 year ago

You can probably access the link via https://lambdalabs.com/blog/multi-node-pytorch-distributed-training-guide.

deepanker13 commented 1 year ago

@ThomaswellY You can access the link via https://lambdalabs.com/blog/multi-node-pytorch-distributed-training-guide.

ThomaswellY commented 1 year ago

@tenzen-y Thanks for your help ! I have read https://lambdalabs.com/blog/multi-node-pytorch-distributed-training-guide carefully, I think its a good reference of applicating distributed training on clusters with inter-connected baremetal servers, but It lacks of the methods of how to applicate the distributed training by training-operator, more specifically how to write the yaml (that's the whole points of this repo, right?) ps: On k8s cluster, The nodes did not needs password-less access to each other, thus compared to submitting command on terminal on baremetal servers, training-operator is more convenient and works out of box. Right ?

There are several related discusses founded in the repo, like: https://github.com/kubeflow/training-operator/issues/1713 and https://github.com/kubeflow/training-operator/issues/1532, both of them used the same way to launch the distributed training like [ "/bin/bash","-c","python3 -m torch.distributed.launch" ……] as what i did in the yaml . The structions of these yamls have not been rejected or criticized by other community member so far, so that means this way of launching distributed training is supported and advocated, doesn't it ?

TBH, my friend, @deepanker13, the yaml you posted above did not consists of any explicit command about torch.distributed.launch or specifying args like "node_rank" ,"nproc_per_node" and so on. So i have to kindly doubt whether this will work in my case.

you have mentioned about "specifying torch.distributed.launch defeats the purpose of PyTorch operator……change the entrypoint and cmd to :-ENTRYPOINT ["python3",relative path to your script according to work dir] CMD []".

So, did you means that " torch.distributed.launch" and other args like "node_rank" etc should not be written here? but i haven't seen this kind of standardized command for launching distributed training in training-operator before, would you mind give me a more concrete example yaml ?

I guess i'm not alone in this confusion, and solving this issue should benefits the whole community. If i made any mistakes, please let me know. looking forward to your reply my friends~

deepanker13 commented 1 year ago

@ThomaswellY I have followed the official documentation https://www.kubeflow.org/docs/components/training/pytorch/ and the example given in it in Creating a PyTorch training job section , where we don't need to mention torch.distributed.launch. This is the sample yaml they have used:- https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/simple.yaml

This is the docker file and code:- https://github.com/kubeflow/training-operator/tree/master/examples/pytorch/mnist

Answering your questions:-

  1. "node_rank" ,"nproc_per_node" should be environment variables in your code which are set by PyTorch operator , see the example code I shared above.
  2. I think mentioning torch.distributed.launch is a hacky and wrong way, then I am wondering what's the purpose of PyTorch job operator.

So Please try the official documentation and it will definitely work

ThomaswellY commented 1 year ago

@deepanker13 Sorry my friend, the link you pointed is out of date, do you mean this sample yaml? https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/simple.yaml however there is no dockerfile……

deepanker13 commented 1 year ago

@ThomaswellY please try these links

https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/simple.yaml

https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/mnist/Dockerfile

ThomaswellY commented 1 year ago

@deepanker13 Thank you for your kind help, my friend~ The yaml you provide did work, but only works on one-node-one-gpu training. The central issue is still remained to be solved. how to modify that yaml to start distributed train? Iet me paste the main part of the yaml below: image as you can see, it applies for two pods(master and worker), master occupies two gpus, and worker occupies one gpu. In my opinion, if this yaml really deployed an distributed training successfully, it should involves with three processes, two happened on master pod, and one happened on worker pod. the logs printed on these two pods should be intuitively different. But let to show the logs printed on these two pods: image image In fact it's exactly the output of the single-gpu traing on my baremetal server. It seems the process happened on these two pods are totally same ! this is not distributed training at all. And script of minist.py is too simple, it seems doesn't support distributed training, because environment args like "master_port","master_addr","world_size","Rank" is not given in the yaml, it still need lots of modification, and how to achieve it is what i have been seeking for in this repo. @johnugeorge @kuizhiqing I‘m sorry to bother you, but i really want to know, did train-operator has ever provided any example yaml of distributed training on pytorch-operator? If not, what's advantage of pytorch-operator? Since i can simply submit "bash dist_train.sh" on baremetal server, which is easy to use with torch.distributed.launch because setting rank of gpu and node is not difficult, right?

johnugeorge commented 1 year ago

@ThomaswellY How did you conclude that there is no distributed training?

DistributedDataParallel won't start unless all participating nodes have joined https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/mnist/mnist.py#L136

You can check examples in https://github.com/kubeflow/training-operator/tree/master/examples/pytorch

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.