Open ThomaswellY opened 1 year ago
@ThomaswellY Can you create an issue on https://github.com/kubeflow/training-operator since the mpi-operator doesn't support v1 API?
or you can consider upgrading to the v2beta API :)
To answer some of your questions:
Ideally the number of workers should match the number of nodes you want to run on. The slotsPerWorker
filed denotes how many tasks will be run in each worker. Generally, this should match the number of GPUs you have per worker.
You don't need to set the OMP_NUM_THREADS
, since that's actually what slotsPerWorker
sets.
If i have node-A with 1 gpus and node-B with 3 gpus, and wanna apply for 4 gpus, then how should i modify the "Worker" part?
In that case, you might want to set the number of GPUs per worker to 1 (along with slotsPerWorker
to 1) and have replicas=4. Not ideal, but it should work.
@ThomaswellY Can you create an issue on https://github.com/kubeflow/training-operator since the mpi-operator doesn't support v1 API?
Thanks for your reply~
the api-resources of my k8s clusters in shown below:
(base) [root@gpu-233 operator]# kubectl api-resources | grep jobs
cronjobs cj batch/v1 true CronJob
jobs batch/v1 true Job
mpijobs kubeflow.org/v1 true MPIJob
mxjobs kubeflow.org/v1 true MXJob
pytorchjobs kubeflow.org/v1 true PyTorchJob
tfjobs kubeflow.org/v1 true TFJob
xgboostjobs kubeflow.org/v1 true XGBoostJob
doesn't that indicates that, in my k8s cluster env, mpijobs is supported by kubeflow.org/v1 API ?
I have applied configs of the example yaml with kubeflow.org/v1 API successfully, and have seen no siginificant errors in pod logs.
@tenzen-y
Thanks for your reply~ I am a little confused about which type of API can support my resource (mpijob in my case). The command "kubectl api-resources" shows mpijobs in my k8s cluster is supported by kubeflow.org/v1, if not, what is the suitable way to confirm which API can support my mpijobs-resource? any official docs would be helpful~
or you can consider upgrading to the v2beta API :)
To answer some of your questions: Ideally the number of workers should match the number of nodes you want to run on. The
slotsPerWorker
filed denotes how many tasks will be run in each worker. Generally, this should match the number of GPUs you have per worker. You don't need to set theOMP_NUM_THREADS
, since that's actually whatslotsPerWorker
sets.If i have node-A with 1 gpus and node-B with 3 gpus, and wanna apply for 4 gpus, then how should i modify the "Worker" part?
In that case, you might want to set the number of GPUs per worker to 1 (along with
slotsPerWorker
to 1) and have replicas=4. Not ideal, but it should work.
I have applied the example yaml in this way successfully, but it seems that 4 gpus are separately be used by 4 pods, and what each worker excuted is a single-gpu training. So it's not distributed training( in this case, i means multi-node with singe-gpu per node training), and whole process takes more time than single-gpu training in one pod which set "replicas=1". What confused me is that, the value of "replicas" seems to only serve as a multiplier for "nvidia.com/gpu". in general, there are some things i wanna confirm:
doesn't that indicates that, in my k8s cluster env, mpijobs is supported by kubeflow.org/v1 API ?
That is correct. @tenzen-y's point is that the v1 implementation is no longer hosted in this repo. If you wish to use the newer v2beta1 version, you have to disable training-operator and install the operator in this repo https://github.com/kubeflow/mpi-operator#installation
The rest of the questions:
@ThomaswellY
Thanks @alculquicondor.
Yes, I meant this repo doesn't support kubeflow.org/v1
, and this repo supports only kubeflow.org/v2beta1
.
Currently, the kubeflow.org/v1
is supported in https://github.com/kubeflow/training-operator.
Also, I would suggest v2beta1
MPIJob for the deepspeed
since https://github.com/kubeflow/training-operator/issues/1792#issuecomment-1519576554.
Also it seems that https://github.com/kubeflow/mpi-operator/pull/549 has proof that v2beta1 can run deepspeed
@alculquicondor @tenzen-y thanks for your kind help! maybe i should use v2beta1 for deepspeed. Anyway, I have executed #549 successfully even in v1, however it seems only cifar10_deepspeed.py needs no modifications, as for gan_deepspeed_train.py, the extra modification is necessary (like args.local_rank = int(os.environ['LOCAL_RANK'])). So #549 is only an example for applying mpi-operator with deepspeed, maybe we can do more for normally applying other script with deepspeed.
@ThomaswellY Thank you for the report!
So https://github.com/kubeflow/mpi-operator/pull/549 is only an example for applying mpi-operator with deepspeed, maybe we can do more for normally applying other script with deepspeed.
Feel free to open PRs. I'm happy to review them :)
Hi, i have been using mpi-operator to achieve distributed training recently。 the most command i used is “kubectl apply -f yaml”. Let me take the mpi-operator yaml for example apiVersion: kubeflow.org/v1 kind: MPIJob metadata: name: cifar spec: slotsPerWorker: 1 runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 restartPolicy: Never template: spec: nodeName: containers: