Closed NettrixTobin closed 1 year ago
Is any crash log, maybe try with kubectl logs -p training-operator-5cc8cdfdd6-xz5qq -n kubeflow
.
@kuizhiqing, I gave it a try and the output is as follows
root@master:~# kubectl logs -p training-operator-5cc8cdfdd6-xz5qq -n kubeflow I1122 03:15:14.029800 1 request.go:601] Waited for 1.049849815s due to client-side throttling, not priority and fairness, request: GET:https://10.96.0.1:443/apis/messaging.knative.dev/v1?timeout=32s 1.6690869151348069e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"} 1.6690869152808545e+09 INFO setup starting manager 1.6690869152815266e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"} 1.6690869152815475e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"} 1.669086915281704e+09 INFO Starting EventSource {"controller": "pytorchjob-controller", "source": "kind source: v1.PyTorchJob"} 1.6690869152818277e+09 INFO Starting EventSource {"controller": "pytorchjob-controller", "source": "kind source: v1.Pod"} 1.6690869152818406e+09 INFO Starting EventSource {"controller": "pytorchjob-controller", "source": "kind source: v1.Service"} 1.6690869152818534e+09 INFO Starting Controller {"controller": "pytorchjob-controller"} 1.6690869152818773e+09 INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: v1.XGBoostJob"} 1.6690869152819426e+09 INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: v1.TFJob"} 1.6690869152819917e+09 INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: v1.Pod"} 1.6690869152820137e+09 INFO Starting EventSource {"controller": "xgboostjob-controller", "source": "kind source: v1.Service"} 1.6690869152820244e+09 INFO Starting Controller {"controller": "xgboostjob-controller"} 1.6690869152819967e+09 INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: v1.Pod"} 1.669086915282056e+09 INFO Starting EventSource {"controller": "tfjob-controller", "source": "kind source: v1.Service"} 1.6690869152820652e+09 INFO Starting Controller {"controller": "tfjob-controller"} 1.6690869152821472e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: v1.MPIJob"} 1.6690869152822428e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: v1.Pod"} 1.6690869152822747e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: v1.ConfigMap"} 1.6690869152822936e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: v1.Role"} 1.6690869152823093e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: v1.RoleBinding"} 1.66908691528229e+09 INFO Starting EventSource {"controller": "mxjob-controller", "source": "kind source: v1.MXJob"} 1.6690869152823222e+09 INFO Starting EventSource {"controller": "mpijob-controller", "source": "kind source: v1.ServiceAccount"} 1.6690869152823365e+09 INFO Starting EventSource {"controller": "mxjob-controller", "source": "kind source: v1.Pod"} 1.669086915282339e+09 INFO Starting Controller {"controller": "mpijob-controller"} 1.6690869152823498e+09 INFO Starting EventSource {"controller": "mxjob-controller", "source": "kind source: v1.Service"} 1.6690869152823596e+09 INFO Starting Controller {"controller": "mxjob-controller"} I1122 03:15:16.378733 1 trace.go:205] Trace[535296393]: "DeltaFIFO Pop Process" ID:kubeflow/argo-role,Depth:15,Reason:slow event handlers blocking the queue (22-Nov-2022 03:15:16.081) (total time: 297ms): Trace[535296393]: [297.332355ms] [297.332355ms] END
Other Pods work well
root@master:~# kubectl get po -A |grep kubeflow kubeflow-user-example-com ml-pipeline-ui-artifact-76474bc75f-w9qcx 2/2 Running 4 (12h ago) 17h kubeflow-user-example-com ml-pipeline-visualizationserver-85f989dbfc-sbmq6 2/2 Running 4 (12h ago) 17h kubeflow admission-webhook-deployment-bb7c6b4d6-hkrj7 1/1 Running 1 (12h ago) 19h kubeflow cache-server-59bf8ff85d-wphp6 2/2 Running 7 (12h ago) 19h kubeflow centraldashboard-8dc67db66-c79wv 2/2 Running 8 (12h ago) 19h kubeflow jupyter-web-app-deployment-59c6bc85cc-nwk9r 1/1 Running 2 (12h ago) 19h kubeflow katib-controller-6478fbd64c-hjqhh 1/1 Running 3 (12h ago) 19h kubeflow katib-db-manager-78fc8b7895-hhpdf 1/1 Running 22 (12h ago) 19h kubeflow katib-mysql-6975d6c6c4-m5rdq 1/1 Running 2 (12h ago) 19h kubeflow katib-ui-5cb6cc4d97-82tvk 1/1 Running 5 (12h ago) 19h kubeflow kserve-controller-manager-0 2/2 Running 7 (12h ago) 19h kubeflow kserve-models-web-app-5454bfdb86-h92kp 2/2 Running 7 (12h ago) 19h kubeflow kubeflow-pipelines-profile-controller-5b8474b7bc-msfl7 1/1 Running 2 (12h ago) 19h kubeflow metacontroller-0 1/1 Running 1 (12h ago) 19h kubeflow metadata-envoy-deployment-6c6f8c6c59-r7sz5 1/1 Running 4 (12h ago) 19h kubeflow metadata-grpc-deployment-679b49cc95-hhcjg 2/2 Running 22 (12h ago) 19h kubeflow metadata-writer-d6567ddf6-8zkq4 2/2 Running 15 (12h ago) 19h kubeflow minio-7955cfc9fc-v2vn4 2/2 Running 2 (12h ago) 19h kubeflow ml-pipeline-5d6f7c985c-pczs7 2/2 Running 22 (12h ago) 19h kubeflow ml-pipeline-persistenceagent-5544dd8bf4-8x5tx 2/2 Running 8 (12h ago) 19h kubeflow ml-pipeline-scheduledworkflow-7d464d85bf-4cn9q 2/2 Running 8 (12h ago) 19h kubeflow ml-pipeline-ui-6576d6ddcb-xvng5 2/2 Running 8 (12h ago) 19h kubeflow ml-pipeline-viewer-crd-59b9f99f9b-25f9c 2/2 Running 9 (12h ago) 19h kubeflow ml-pipeline-visualizationserver-7c7464896f-hn7zn 2/2 Running 8 (12h ago) 19h kubeflow mysql-75f4964b48-x557v 2/2 Running 2 (12h ago) 19h kubeflow notebook-controller-deployment-68f88d5479-xhmd6 2/2 Running 4 (12h ago) 19h kubeflow profiles-deployment-6d754c7bc7-fcbjq 3/3 Running 5 (12h ago) 19h kubeflow tensorboard-controller-deployment-6d67f8bfff-xhn5g 3/3 Running 7 (12h ago) 19h kubeflow tensorboards-web-app-deployment-8446c8f5b5-4zc4h 1/1 Running 1 (12h ago) 19h kubeflow training-operator-5cc8cdfdd6-xz5qq 0/1 CrashLoopBackOff 177 (4m28s ago) 14h kubeflow volumes-web-app-deployment-b579747b4-8mqv2 1/1 Running 1 (12h ago) 19h kubeflow workflow-controller-555f64865-66tsm 2/2 Running 14 (12h ago) 19h
@NettrixTobin Any other interesting logs? Is it any resource limits issue?
@NettrixTobin Any other interesting logs? Is it any resource limits issue?
@johnugeorge Could you please provide me with some ideas or instructions? I did not find any other pod error
Is it any Out of Memory issue? I am not seeing any issues in logs.
@NettrixTobin maybe you can run the controller locally, which means compile and run cmd/training-operator.v1/main.go
with your local kubeconfig or just run with make run
. This can handle RBAC or config version mismatch things.
I increased the resource and it really worked.
Closing it as