kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
419 stars 210 forks source link

questions about applying for nodes and gpus #558

Open ThomaswellY opened 1 year ago

ThomaswellY commented 1 year ago

Hi, i have been using mpi-operator to achieve distributed training recently。 the most command i used is “kubectl apply -f yaml”. Let me take the mpi-operator yaml for example apiVersion: kubeflow.org/v1 kind: MPIJob metadata: name: cifar spec: slotsPerWorker: 1 runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 restartPolicy: Never template: spec: nodeName: containers:

tenzen-y commented 1 year ago

@ThomaswellY Can you create an issue on https://github.com/kubeflow/training-operator since the mpi-operator doesn't support v1 API?

alculquicondor commented 1 year ago

or you can consider upgrading to the v2beta API :)

To answer some of your questions: Ideally the number of workers should match the number of nodes you want to run on. The slotsPerWorker filed denotes how many tasks will be run in each worker. Generally, this should match the number of GPUs you have per worker. You don't need to set the OMP_NUM_THREADS, since that's actually what slotsPerWorker sets.

If i have node-A with 1 gpus and node-B with 3 gpus, and wanna apply for 4 gpus, then how should i modify the "Worker" part?

In that case, you might want to set the number of GPUs per worker to 1 (along with slotsPerWorker to 1) and have replicas=4. Not ideal, but it should work.

ThomaswellY commented 1 year ago

@ThomaswellY Can you create an issue on https://github.com/kubeflow/training-operator since the mpi-operator doesn't support v1 API?

Thanks for your reply~ the api-resources of my k8s clusters in shown below: (base) [root@gpu-233 operator]# kubectl api-resources | grep jobs cronjobs cj batch/v1 true CronJob jobs batch/v1 true Job mpijobs kubeflow.org/v1 true MPIJob mxjobs kubeflow.org/v1 true MXJob pytorchjobs kubeflow.org/v1 true PyTorchJob tfjobs kubeflow.org/v1 true TFJob xgboostjobs kubeflow.org/v1 true XGBoostJob doesn't that indicates that, in my k8s cluster env, mpijobs is supported by kubeflow.org/v1 API ?
I have applied configs of the example yaml with kubeflow.org/v1 API successfully, and have seen no siginificant errors in pod logs. @tenzen-y

ThomaswellY commented 1 year ago

Thanks for your reply~ I am a little confused about which type of API can support my resource (mpijob in my case). The command "kubectl api-resources" shows mpijobs in my k8s cluster is supported by kubeflow.org/v1, if not, what is the suitable way to confirm which API can support my mpijobs-resource? any official docs would be helpful~

or you can consider upgrading to the v2beta API :)

To answer some of your questions: Ideally the number of workers should match the number of nodes you want to run on. The slotsPerWorker filed denotes how many tasks will be run in each worker. Generally, this should match the number of GPUs you have per worker. You don't need to set the OMP_NUM_THREADS, since that's actually what slotsPerWorker sets.

If i have node-A with 1 gpus and node-B with 3 gpus, and wanna apply for 4 gpus, then how should i modify the "Worker" part?

In that case, you might want to set the number of GPUs per worker to 1 (along with slotsPerWorker to 1) and have replicas=4. Not ideal, but it should work.

I have applied the example yaml in this way successfully, but it seems that 4 gpus are separately be used by 4 pods, and what each worker excuted is a single-gpu training. So it's not distributed training( in this case, i means multi-node with singe-gpu per node training), and whole process takes more time than single-gpu training in one pod which set "replicas=1". What confused me is that, the value of "replicas" seems to only serve as a multiplier for "nvidia.com/gpu". in general, there are some things i wanna confirm:

  1. How to confirm which API can support the mpi-operator, if "kubectl api-resources" did not work, then which command should be submitted?
  2. when resource limit sets gpu number to 1 ( because one node of k8s cluster has only one gpu available in this case), then distributed training can not be launched, even multi-pod can separately executes single-gpu training when set replicas>1, it's in fact a repetitive behavior of single-training.
  3. If i have node-1 with 2 gpus and node-2 with 4 gpus, the most effective distributed training that mpi-operator can launcher is about 2 nodes with 2 gpus per node, and the ideal config is that setting "slotsPerWorker: 2","replicas: 2",and "nvidia.com/gpu: 2". The questions are a little too many, I am sorry if that troubles you. Thanks you in advance~ @alculquicondor
alculquicondor commented 1 year ago

doesn't that indicates that, in my k8s cluster env, mpijobs is supported by kubeflow.org/v1 API ?

That is correct. @tenzen-y's point is that the v1 implementation is no longer hosted in this repo. If you wish to use the newer v2beta1 version, you have to disable training-operator and install the operator in this repo https://github.com/kubeflow/mpi-operator#installation

The rest of the questions:

  1. the command did work, you are running v1.
  2. It sounds like a problem in your application, not mpi-operator. Did you miss any parameters in your command? I'm not familiar with deepspeed.
  3. yes
tenzen-y commented 1 year ago

@ThomaswellY Thanks @alculquicondor. Yes, I meant this repo doesn't support kubeflow.org/v1, and this repo supports only kubeflow.org/v2beta1. Currently, the kubeflow.org/v1 is supported in https://github.com/kubeflow/training-operator.

Also, I would suggest v2beta1 MPIJob for the deepspeed since https://github.com/kubeflow/training-operator/issues/1792#issuecomment-1519576554.

alculquicondor commented 1 year ago

Also it seems that https://github.com/kubeflow/mpi-operator/pull/549 has proof that v2beta1 can run deepspeed

ThomaswellY commented 1 year ago

@alculquicondor @tenzen-y thanks for your kind help! maybe i should use v2beta1 for deepspeed. Anyway, I have executed #549 successfully even in v1, however it seems only cifar10_deepspeed.py needs no modifications, as for gan_deepspeed_train.py, the extra modification is necessary (like args.local_rank = int(os.environ['LOCAL_RANK'])). So #549 is only an example for applying mpi-operator with deepspeed, maybe we can do more for normally applying other script with deepspeed.

tenzen-y commented 1 year ago

@ThomaswellY Thank you for the report!

So https://github.com/kubeflow/mpi-operator/pull/549 is only an example for applying mpi-operator with deepspeed, maybe we can do more for normally applying other script with deepspeed.

Feel free to open PRs. I'm happy to review them :)