Open alenawang opened 1 month ago
Thanks for this feedback.
Actually, for design purpose, we no need to set master at all, https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/imagenet/imagenet.yaml and https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/echo/echo.yaml.
This design make sense, since in the elastic scenario, nodes are treat equally.
What happened?
When elasticPolicy is set on the manifest but the user does not pass in minReplicas or maxReplicas explicitly, the PET_NNODES env var is set to
x:x
wherex
is the number of worker replicas only - it does not seem to be including the master replica in this count. When elasticPolicy is not set, PET_NNODES is set to a single number that is the master + number of worker replicas, which seems correct.What did you expect to happen?
We expected PET_NNODES to be set to
x:x
wherex
is the total number of replicas (master + workers). Does this make sense? If so we would be interested in contributing this fix.Environment
Kubernetes version:
v1.29.8
Training Operator version:
v1-855e096
, also tested a local build using the latest on masterTraining Operator Python SDK version: N/A
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍