kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.57k stars 686 forks source link

Regarding whether the tf-job-operator v1.0 metrics can expose specific failed pods #2220

Open SecretSun opened 1 month ago

SecretSun commented 1 month ago

What you would like to be added?

tf-job-operator v1.0 metrics can expose specific failed pods

The logging details are as follows

content="{'filename':'record/event.go:221','level':'info','msg':'Event(v1.ObjectReference{Kind:\'TFJob\', Namespace:\'iem-trs-training\', Name:\'android-consume-v2-update-2024-08-20-053447\', UID:\'803a4aca-561b-4609-9ee3-8953f075b66c\', APIVersion:\'kubeflow.org/v1\', ResourceVersion:\'860397344\', FieldPath:\'\'}): type: 'Normal' reason: 'ExitedWithCode' Pod: iem-trs-training.android-consume-v2-update-2024-08-20-053447-worker-0 exited with code 1','time':'2024-08-20T22:35:42Z'}"

Why is this needed?

It is necessary to quickly locate the training node with specific problems

Love this feature?

Give it a 👍 We prioritize the features with most 👍

andreyvelich commented 3 weeks ago

Thank you for creating this @SecretSun! Are you looking to improve the Failed message in TFJob events: https://github.com/kubeflow/training-operator/blob/master/pkg/controller.v1/tensorflow/tfjob_controller.go#L501-L503 ?

/area monitoring /remove-label lifecycle/needs-triage

SecretSun commented 1 week ago

content="{'filename':'record/event.go:221','level':'info','msg':'Event(v1.ObjectReference{Kind:\'TFJob\', Namespace:\'iem-trs-training\', Name:\'android-uni-item-eval-2024-09-09-043604\', UID:\'f31ff823-5904-41cb-b490-e368426e8061\', APIVersion:\'kubeflow.org/v1\', ResourceVersion:\'906983038\', FieldPath:\'\'}): type: 'Normal' reason: 'TFJobFailed' TFJob android-uni-item-eval-2024-09-09-043604 has failed because 1 Worker replica(s) failed.','time':'2024-09-09T20:52:45Z'}"

It is possible that there is no understanding of the meaning, whether the support can be specific to that worker