kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
419 stars 210 forks source link

Add the ability to disable worker/launcher pod name suffix #591

Closed AymenFJA closed 10 months ago

AymenFJA commented 10 months ago

MPI Operator currently appends a random 5 character suffix to the launcher and worker pod names.

For example:

NAME                                READY   STATUS             RESTARTS      AGE
myjob-mpi-ctask-4-launcher-qp6xv   0/1     Running  5 (23s ago)   3m38s
myjob-mpi-ctask-4-worker-0         1/1     Running            0             3m38s

This suffix causes the pod names to be unique for each job run. However, in some cases, static pod names may be preferred if a the names are guaranteed to be generated in a unique way.

If this feature already exists then please ignore my proposal and can you please direct me to how I can disable this behavior. If not the:

Proposal:

Add an optional field or setting to allow disabling the random pod name suffix, for example:


apiVersion: kubeflow.org/v1
kind: MPIJob
spec:
  disablePodNameSuffix: true

This would allow users to have deterministic pod names like:

ctask-0-mpi-launcher
ctask-0-mpi-worker
Across multiple job runs.

Potential downsides might be naming conflicts if pods are not cleaned up properly. Logs may also be harder to disambiguate.

But having the option to turn off the random suffix would provide more flexibility for users who want static pod names.

Let me know if this is something that could be considered for a future release!
alculquicondor commented 10 months ago

This is not possible. The launcher Pod is created through the Job API, which would use random name generation.

I would suggest to use labels to identify the pods. Your logging solution might have some tooling to allow you to see all the logs for a label.

AymenFJA commented 10 months ago

Thank you, @alculquicondor and that makes sense.