PET_NNODES env var for PyTorchJobs is incorrect when elasticPolicy is set

What happened?

When elasticPolicy is set on the manifest but the user does not pass in minReplicas or maxReplicas explicitly, the PET_NNODES env var is set to x:x where x is the number of worker replicas only - it does not seem to be including the master replica in this count. When elasticPolicy is not set, PET_NNODES is set to a single number that is the master + number of worker replicas, which seems correct.

What did you expect to happen?

We expected PET_NNODES to be set to x:x where x is the total number of replicas (master + workers). Does this make sense? If so we would be interested in contributing this fix.

Environment

Kubernetes version: v1.29.8

Training Operator version: v1-855e096, also tested a local build using the latest on master

Training Operator Python SDK version: N/A

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

kubeflow / training-operator

PET_NNODES env var for PyTorchJobs is incorrect when elasticPolicy is set #2277

What happened?

What did you expect to happen?

Environment

Impacted by this bug?