kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
420 stars 211 forks source link

Add tolerations only to specific worker pods #539

Open anxietymonger opened 1 year ago

anxietymonger commented 1 year ago

I am currently facing with a scenario, where I need to schedule worker0 pod onto specific nodes. Is there any possibility to configure custom tolerations or labels only to specific worker pods (like worker0). Thank you in advance for any information.

tenzen-y commented 1 year ago

No, the mpi-operator doesn't support configuring parameters for a specific worker.

tenzen-y commented 1 year ago

/kind question

anxietymonger commented 1 year ago

Thank you very much for your information. Closing the issue.

alculquicondor commented 1 year ago

Why is worker 0 special? In addition to the launcher already being "special".

anxietymonger commented 1 year ago

In my case, worker0 is special since rank0 need to access some resources that only exists on specific nodes. It is true that launcher is already special to some extents, but in my understanding, the launcher won't do any computation, right? BTW, is it possible to make rank0 running on the launcher?

alculquicondor commented 1 year ago

That is correct, the launcher just coordinates.

Having the workers in its own pods has the advantage that the resources can be exclusive to the worker computations, as opposed to be shared with launcher tasks. Nothing prohibits the launcher pod to be in the same node as other workers, but you have the isolation of the pod namespaces to have better control.

alculquicondor commented 1 year ago

I wonder how common is the specialization you mention is.

Would we need to add support for an arbitrary number of pod templates?

cc @ahg-g @danielvegamyhre

alculquicondor commented 1 year ago

/reopen

google-oss-prow[bot] commented 1 year ago

@alculquicondor: Reopened this issue.

In response to [this](https://github.com/kubeflow/mpi-operator/issues/539#issuecomment-1488876687): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
alculquicondor commented 1 year ago

This was also discussed in #384