kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k stars 701 forks source link

PET_NNODES env var for PyTorchJobs is incorrect when elasticPolicy is set #2277

Open alenawang opened 1 month ago

alenawang commented 1 month ago

What happened?

When elasticPolicy is set on the manifest but the user does not pass in minReplicas or maxReplicas explicitly, the PET_NNODES env var is set to x:x where x is the number of worker replicas only - it does not seem to be including the master replica in this count. When elasticPolicy is not set, PET_NNODES is set to a single number that is the master + number of worker replicas, which seems correct.

What did you expect to happen?

We expected PET_NNODES to be set to x:x where x is the total number of replicas (master + workers). Does this make sense? If so we would be interested in contributing this fix.

Environment

Kubernetes version: v1.29.8

Training Operator version: v1-855e096, also tested a local build using the latest on master

Training Operator Python SDK version: N/A

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

kuizhiqing commented 1 month ago

Thanks for this feedback.

Actually, for design purpose, we no need to set master at all, https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/imagenet/imagenet.yaml and https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/echo/echo.yaml.

This design make sense, since in the elastic scenario, nodes are treat equally.