Open SecretSun opened 1 month ago
Thank you for creating this @SecretSun! Are you looking to improve the Failed message in TFJob events: https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/tensorflow/tfjob_controller.go#L501-L503 ?
/area monitoring /remove-label lifecycle/needs-triage
content="{'filename':'record/event.go:221','level':'info','msg':'Event(v1.ObjectReference{Kind:\'TFJob\', Namespace:\'iem-trs-training\', Name:\'android-uni-item-eval-2024-09-09-043604\', UID:\'f31ff823-5904-41cb-b490-e368426e8061\', APIVersion:\'kubeflow.org/v1\', ResourceVersion:\'906983038\', FieldPath:\'\'}): type: 'Normal' reason: 'TFJobFailed' TFJob android-uni-item-eval-2024-09-09-043604 has failed because 1 Worker replica(s) failed.','time':'2024-09-09T20:52:45Z'}"
It is possible that there is no understanding of the meaning, whether the support can be specific to that worker
What you would like to be added?
tf-job-operator v1.0 metrics can expose specific failed pods
The logging details are as follows
content="{'filename':'record/event.go:221','level':'info','msg':'Event(v1.ObjectReference{Kind:\'TFJob\', Namespace:\'iem-trs-training\', Name:\'android-consume-v2-update-2024-08-20-053447\', UID:\'803a4aca-561b-4609-9ee3-8953f075b66c\', APIVersion:\'kubeflow.org/v1\', ResourceVersion:\'860397344\', FieldPath:\'\'}): type: 'Normal' reason: 'ExitedWithCode' Pod: iem-trs-training.android-consume-v2-update-2024-08-20-053447-worker-0 exited with code 1','time':'2024-08-20T22:35:42Z'}"
Why is this needed?
It is necessary to quickly locate the training node with specific problems
Love this feature?
Give it a 👍 We prioritize the features with most 👍