kubeflow / katib

Automated Machine Learning on Kubernetes
https://www.kubeflow.org/docs/components/katib
Apache License 2.0
1.51k stars 443 forks source link

Back-off restarting failed container katib-controller #2440

Open qazserfv123 opened 1 month ago

qazserfv123 commented 1 month ago

What happened?

After installed install the latest changes of Katib control plane

Run kubectl get pod -n kubeflow and the result is

root@k8master:~# kubectl get pod -n kubeflow
NAME                                READY   STATUS             RESTARTS         AGE
katib-controller-86fbb67df-5mgpx    0/1     CrashLoopBackOff   52 (4m39s ago)   5h49m
katib-db-manager-7c8745f44b-4tzm5   0/1     CrashLoopBackOff   56 (54s ago)     5h49m
katib-mysql-77b9495867-fqb5l        0/1     Pending            0                5h49m
katib-ui-5d9c77cfc4-4bfzl           1/1     Running            0                5h49m

and run kubectl describe pod katib-controller-86fbb67df-5mgpx -n kubeflow , the result is

Name:             katib-controller-86fbb67df-5mgpx
Namespace:        kubeflow
Priority:         0
Service Account:  katib-controller
Node:             k8node02/192.168.100.12
Start Time:       Thu, 10 Oct 2024 02:20:03 +0000
Labels:           katib.kubeflow.org/component=controller
                  katib.kubeflow.org/metrics-collector-injection=disabled
                  pod-template-hash=86fbb67df
Annotations:      prometheus.io/port: 8080
                  prometheus.io/scrape: true
                  sidecar.istio.io/inject: false
Status:           Running
IP:               10.244.0.3
IPs:
  IP:           10.244.0.3
Controlled By:  ReplicaSet/katib-controller-86fbb67df
Containers:
  katib-controller:
    Container ID:  docker://ec8cfc87a2c33a75ae61fd2d7ac906ccf52800fb49159e6e6253f129c0fd86bf
    Image:         docker.io/kubeflowkatib/katib-controller:latest
    Image ID:      docker-pullable://kubeflowkatib/katib-controller@sha256:103962f0810467fc5f6edcb46b8343387a289dd113dce38933ab15d3b0713261
    Ports:         8443/TCP, 8080/TCP, 18080/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Command:
      ./katib-controller
    Args:
      --katib-config=/katib-config.yaml
    State:          Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 10 Oct 2024 08:10:54 +0000
      Finished:     Thu, 10 Oct 2024 08:11:24 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Thu, 10 Oct 2024 08:04:52 +0000
      Finished:     Thu, 10 Oct 2024 08:05:22 +0000
    Ready:          False
    Restart Count:  53
    Liveness:       http-get http://:healthz/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:      http-get http://:healthz/readyz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:
      KATIB_CORE_NAMESPACE:  kubeflow (v1:metadata.namespace)
    Mounts:
      /katib-config.yaml from katib-config (ro,path="katib-config.yaml")
      /tmp/cert from cert (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-s4x2k (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  katib-webhook-cert
    Optional:    false
  katib-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      katib-config
    Optional:  false
  kube-api-access-s4x2k:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                      From     Message
  ----     ------     ----                     ----     -------
  Normal   Pulled     36m (x39 over 4h20m)     kubelet  (combined from similar events): Successfully pulled image "docker.io/kubeflowkatib/katib-controller:latest" in 20.234160626s (20.234172377s including waiting)
  Warning  Unhealthy  6m18s (x261 over 4h49m)  kubelet  Readiness probe failed: HTTP probe failed with statuscode: 500
  Warning  BackOff    85s (x1164 over 4h48m)   kubelet  Back-off restarting failed container katib-controller in pod katib-controller-86fbb67df-5mgpx_kubeflow(c1cd3096-6bcc-4db2-969b-8f0ac265ae05)

Thanks!

What did you expect to happen?

Run kubectl get pod -n kubeflow and the result is

root@k8master:~# kubectl get pod -n kubeflow
NAME                                READY   STATUS             RESTARTS         AGE
katib-controller-86fbb67df-5mgpx    1/1     Running            52 (4m39s ago)   5h49m
katib-db-manager-7c8745f44b-4tzm5   1/1     Running            56 (54s ago)     5h49m
katib-mysql-77b9495867-fqb5l       1/1     Running            0                5h49m
katib-ui-5d9c77cfc4-4bfzl           1/1     Running            0                5h49m

Environment

Kubernetes version:

WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.0", GitCommit:"1b4df30b3cdfeaba6024e81e559a6cd09a089d65", GitTreeState:"clean", BuildDate:"2023-04-11T17:10:18Z", GoVersion:"go1.20.3", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.16", GitCommit:"cbb86e0d7f4a049666fac0551e8b02ef3d6c3d9a", GitTreeState:"clean", BuildDate:"2024-07-17T01:44:26Z", GoVersion:"go1.22.5", Compiler:"gc", Platform:"linux/amd64"}

Katib controller version: `` docker.io/kubeflowkatib/katib-controller:latest


Katib Python SDK version:

Name: kubeflow-katib Version: 0.17.0 Summary: Katib Python SDK for APIVersion v1beta1 Home-page: https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1 Author: Kubeflow Authors Author-email: premnath.vel@gmail.com License: Apache License Version 2.0 Location: /root/miniconda3/lib/python3.10/site-packages Requires: certifi, grpcio, kubernetes, protobuf, setuptools, six, urllib3 Required-by:



### Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍