kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.6k stars 697 forks source link

Training-operator pod CrashLoopBackOff in K8s v1.23.6 with kubeflow1.6.1 #1693

Closed NettrixTobin closed 1 year ago

NettrixTobin commented 1 year ago
`root@master:~# kubectl logs -f training-operator-5cc8cdfdd6-xz5qq -n kubeflow
I1122 01:52:15.291326       1 request.go:601] Waited for 1.011932954s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/serving.kserve.io/v1alpha1?timeout=32s
1.6690819362796633e+09  INFO    controller-runtime.metrics      Metrics server is starting to listen    {"addr": ":8080"}
1.6690819362820742e+09  INFO    setup   starting manager
1.6690819363789077e+09  INFO    Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
1.6690819363790262e+09  INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
1.6690819363792255e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.MPIJob"}
1.6690819363792121e+09  INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.MXJob"}
1.669081936379304e+09   INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.Pod"}
1.669081936379264e+09   INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.PyTorchJob"}
1.6690819363793213e+09  INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.Pod"}
1.669081936379366e+09   INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.Pod"}
1.6690819363793771e+09  INFO    Starting EventSource    {"controller": "mxjob-controller", "source": "kind source: *v1.Service"}
1.6690819363793895e+09  INFO    Starting EventSource    {"controller": "pytorchjob-controller", "source": "kind source: *v1.Service"}
1.6690819363794024e+09  INFO    Starting Controller     {"controller": "pytorchjob-controller"}
1.6690819363793309e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.ConfigMap"}
1.669081936379433e+09   INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.Role"}
1.6690819363794458e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.RoleBinding"}
1.6690819363794565e+09  INFO    Starting EventSource    {"controller": "mpijob-controller", "source": "kind source: *v1.ServiceAccount"}
1.6690819363793914e+09  INFO    Starting Controller     {"controller": "mxjob-controller"}
1.6690819363794715e+09  INFO    Starting Controller     {"controller": "mpijob-controller"}
1.6690819363794422e+09  INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.TFJob"}
1.6690819363794968e+09  INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.Pod"}
1.669081936379517e+09   INFO    Starting EventSource    {"controller": "tfjob-controller", "source": "kind source: *v1.Service"}
1.6690819363795297e+09  INFO    Starting Controller     {"controller": "tfjob-controller"}
1.66908193637956e+09    INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.XGBoostJob"}
1.6690819363796897e+09  INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.Pod"}
1.6690819363797452e+09  INFO    Starting EventSource    {"controller": "xgboostjob-controller", "source": "kind source: *v1.Service"}
1.6690819363797586e+09  INFO    Starting Controller     {"controller": "xgboostjob-controller"}
I1122 01:52:17.578664       1 trace.go:205] Trace[1095108423]: "DeltaFIFO Pop Process" ID:gpu-operator-resources/default,Depth:59,Reason:slow event handlers blocking the queue (22-Nov-2022 01:52:17.182) (total time: 295ms):
Trace[1095108423]: [295.716397ms] [295.716397ms] END
I1122 01:52:17.679340       1 trace.go:205] Trace[935737529]: "DeltaFIFO Pop Process" ID:kube-system/token-cleaner,Depth:58,Reason:slow event handlers blocking the queue (22-Nov-2022 01:52:17.578) (total time: 100ms):
`
kuizhiqing commented 1 year ago

Is any crash log, maybe try with kubectl logs -p training-operator-5cc8cdfdd6-xz5qq -n kubeflow.

NettrixTobin commented 1 year ago

@kuizhiqing, I gave it a try and the output is as follows

root@master:~# kubectl logs -p training-operator-5cc8cdfdd6-xz5qq -n kubeflow I1122 03:15:14.029800 1 request.go:601] Waited for 1.049849815s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/messaging.knative.dev/v1?timeout=32s 1.6690869151348069e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"} 1.6690869152808545e+09 INFO setup starting manager 1.6690869152815266e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"} 1.6690869152815475e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"} 1.669086915281704e+09 INFO Starting EventSource {"controller": "pytorchjob-controller", "source": "kind source: v1.PyTorchJob"} 1.6690869152818277e+09 INFO Starting EventSource {"controller": "pytorchjob-controller", "source": "kind source: v1.Pod"} 1.6690869152818406e+09 INFO Starting EventSource {"controller": "pytorchjob-controller", "source": "kind source: v1.Service"} 1.6690869152818534e+09 INFO Starting Controller {"controller": "pytorchjob-controller"} 1.6690869152818773e+09 INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: v1.XGBoostJob"} 1.6690869152819426e+09 INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: v1.TFJob"} 1.6690869152819917e+09 INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: v1.Pod"} 1.6690869152820137e+09 INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: v1.Service"} 1.6690869152820244e+09 INFO Starting Controller {"controller": "xgboostjob-controller"} 1.6690869152819967e+09 INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: v1.Pod"} 1.669086915282056e+09 INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: v1.Service"} 1.6690869152820652e+09 INFO Starting Controller {"controller": "tfjob-controller"} 1.6690869152821472e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: v1.MPIJob"} 1.6690869152822428e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: v1.Pod"} 1.6690869152822747e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: v1.ConfigMap"} 1.6690869152822936e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: v1.Role"} 1.6690869152823093e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: v1.RoleBinding"} 1.66908691528229e+09 INFO Starting EventSource {"controller": "mxjob-controller", "source": "kind source: v1.MXJob"} 1.6690869152823222e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: v1.ServiceAccount"} 1.6690869152823365e+09 INFO Starting EventSource {"controller": "mxjob-controller", "source": "kind source: v1.Pod"} 1.669086915282339e+09 INFO Starting Controller {"controller": "mpijob-controller"} 1.6690869152823498e+09 INFO Starting EventSource {"controller": "mxjob-controller", "source": "kind source: v1.Service"} 1.6690869152823596e+09 INFO Starting Controller {"controller": "mxjob-controller"} I1122 03:15:16.378733 1 trace.go:205] Trace[535296393]: "DeltaFIFO Pop Process" ID:kubeflow/argo-role,Depth:15,Reason:slow event handlers blocking the queue (22-Nov-2022 03:15:16.081) (total time: 297ms): Trace[535296393]: [297.332355ms] [297.332355ms] END

Other Pods work well

root@master:~# kubectl get po -A |grep kubeflow kubeflow-user-example-com ml-pipeline-ui-artifact-76474bc75f-w9qcx 2/2 Running 4 (12h ago) 17h kubeflow-user-example-com ml-pipeline-visualizationserver-85f989dbfc-sbmq6 2/2 Running 4 (12h ago) 17h kubeflow admission-webhook-deployment-bb7c6b4d6-hkrj7 1/1 Running 1 (12h ago) 19h kubeflow cache-server-59bf8ff85d-wphp6 2/2 Running 7 (12h ago) 19h kubeflow centraldashboard-8dc67db66-c79wv 2/2 Running 8 (12h ago) 19h kubeflow jupyter-web-app-deployment-59c6bc85cc-nwk9r 1/1 Running 2 (12h ago) 19h kubeflow katib-controller-6478fbd64c-hjqhh 1/1 Running 3 (12h ago) 19h kubeflow katib-db-manager-78fc8b7895-hhpdf 1/1 Running 22 (12h ago) 19h kubeflow katib-mysql-6975d6c6c4-m5rdq 1/1 Running 2 (12h ago) 19h kubeflow katib-ui-5cb6cc4d97-82tvk 1/1 Running 5 (12h ago) 19h kubeflow kserve-controller-manager-0 2/2 Running 7 (12h ago) 19h kubeflow kserve-models-web-app-5454bfdb86-h92kp 2/2 Running 7 (12h ago) 19h kubeflow kubeflow-pipelines-profile-controller-5b8474b7bc-msfl7 1/1 Running 2 (12h ago) 19h kubeflow metacontroller-0 1/1 Running 1 (12h ago) 19h kubeflow metadata-envoy-deployment-6c6f8c6c59-r7sz5 1/1 Running 4 (12h ago) 19h kubeflow metadata-grpc-deployment-679b49cc95-hhcjg 2/2 Running 22 (12h ago) 19h kubeflow metadata-writer-d6567ddf6-8zkq4 2/2 Running 15 (12h ago) 19h kubeflow minio-7955cfc9fc-v2vn4 2/2 Running 2 (12h ago) 19h kubeflow ml-pipeline-5d6f7c985c-pczs7 2/2 Running 22 (12h ago) 19h kubeflow ml-pipeline-persistenceagent-5544dd8bf4-8x5tx 2/2 Running 8 (12h ago) 19h kubeflow ml-pipeline-scheduledworkflow-7d464d85bf-4cn9q 2/2 Running 8 (12h ago) 19h kubeflow ml-pipeline-ui-6576d6ddcb-xvng5 2/2 Running 8 (12h ago) 19h kubeflow ml-pipeline-viewer-crd-59b9f99f9b-25f9c 2/2 Running 9 (12h ago) 19h kubeflow ml-pipeline-visualizationserver-7c7464896f-hn7zn 2/2 Running 8 (12h ago) 19h kubeflow mysql-75f4964b48-x557v 2/2 Running 2 (12h ago) 19h kubeflow notebook-controller-deployment-68f88d5479-xhmd6 2/2 Running 4 (12h ago) 19h kubeflow profiles-deployment-6d754c7bc7-fcbjq 3/3 Running 5 (12h ago) 19h kubeflow tensorboard-controller-deployment-6d67f8bfff-xhn5g 3/3 Running 7 (12h ago) 19h kubeflow tensorboards-web-app-deployment-8446c8f5b5-4zc4h 1/1 Running 1 (12h ago) 19h kubeflow training-operator-5cc8cdfdd6-xz5qq 0/1 CrashLoopBackOff 177 (4m28s ago) 14h kubeflow volumes-web-app-deployment-b579747b4-8mqv2 1/1 Running 1 (12h ago) 19h kubeflow workflow-controller-555f64865-66tsm 2/2 Running 14 (12h ago) 19h

johnugeorge commented 1 year ago

@NettrixTobin Any other interesting logs? Is it any resource limits issue?

NettrixTobin commented 1 year ago

@NettrixTobin Any other interesting logs? Is it any resource limits issue?

@johnugeorge Could you please provide me with some ideas or instructions? I did not find any other pod error

johnugeorge commented 1 year ago

Is it any Out of Memory issue? I am not seeing any issues in logs.

kuizhiqing commented 1 year ago

@NettrixTobin maybe you can run the controller locally, which means compile and run cmd/training-operator.v1/main.go with your local kubeconfig or just run with make run. This can handle RBAC or config version mismatch things.

caffeinism commented 1 year ago

I increased the resource and it really worked.

johnugeorge commented 1 year ago

Closing it as