Open srinandan opened 1 year ago
The problem is in the pytorch launcher. It appears the training operator does not like a master spec with {}
. My PyTorchJob did not have a master spec.
Yes, it was confused here which need to set master spec in collective training mode, maybe we will solve it soon.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/good-first-issue
@johnugeorge: This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue
command.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/lifecycle frozen
I am interested in working on this good first issue. Can you explain the recommended solution? /assign
WHAT DID YOU DO:
Deployed Kubeflow 1.7.0 to a 1.25.8-gke.1000 GKE cluster. The training-operator image installed is:
kubeflow/training-operator:v1-5a5f92d
EXPECTED:
I started a run for a pipeline (kpf version 1.8) and I expected the training job to start.
ACTUAL:
TrainingOperator crash CrashLoopBackOff
Logs from the container: