shikanon / kubeflow-manifests

kubeflow国内一键安装文件
GNU General Public License v3.0
341 stars 118 forks source link

Pods 不能全部进入 Running 状态 #20

Closed HarborZeng closed 3 years ago

HarborZeng commented 3 years ago

使用 Rancher 创建了 kubernetes 集群,1.17 版本

image

然后在 control plane 节点上配置了此集群的 ~/.kube/config,所以 control plane 节点上可以使用 kubectl,执行了 python3 install.py,然后等待了半小时以后,发现有 7 个 pods 在报错

image

image

他们分别是:

authservice

image

admission-webhook-deployment

image

cache-deployer-deployment

image

日志:ERROR: After approving csr cache-server.kubeflow, the signed certificate did not appear on the resource. Giving up after 10 attempts.

cache-server

image

katib-controller

image

katib-db-manager

image

日志:

Ping to Katib db failed: dial tcp 10.43.240.163:3306: connect: no route to host

Failed to open db connection: DB open failed: Timeout waiting for DB conn successfully opened.

katib-mysql

image

kfserving-controller-manager

image

不知道是什么原因,请求您的帮助

shikanon commented 3 years ago

可以kubectl describe看看 mysql 为啥没起来

HarborZeng commented 3 years ago

又研究了几天,现在只有两个pods有问题,一个是 cache-server,一个是 cache-deployer-deployment

$ kubectl get pods -A
NAMESPACE                   NAME                                                        READY   STATUS             RESTARTS   AGE
auth                        dex-6686f66f9b-m9vds                                        1/1     Running            0          142m
cattle-system               cattle-cluster-agent-69d44b7858-5ttvb                       0/1     CrashLoopBackOff   36         160m
cattle-system               cattle-cluster-agent-854998cfb7-gzmjn                       0/1     ImagePullBackOff   0          160m
cattle-system               cattle-node-agent-6rsqw                                     1/1     Running            0          157m
cattle-system               cattle-node-agent-q2g67                                     1/1     Running            0          170m
cattle-system               kube-api-auth-xww7h                                         1/1     Running            0          157m
cert-manager                cert-manager-9d5774b59-8mjc5                                1/1     Running            0          142m
cert-manager                cert-manager-cainjector-67c8c5c665-dxhdh                    1/1     Running            0          142m
cert-manager                cert-manager-webhook-75dc9757bd-txnm2                       1/1     Running            1          142m
ingress-nginx               nginx-ingress-controller-9jzpx                              1/1     Running            0          170m
ingress-nginx               nginx-ingress-controller-bscxv                              1/1     Running            0          157m
istio-system                authservice-0                                               1/1     Running            0          142m
istio-system                cluster-local-gateway-66bcf8bc5d-2nqsm                      1/1     Running            0          141m
istio-system                istio-ingressgateway-85b49c758f-gmzzs                       1/1     Running            0          141m
istio-system                istiod-5ff6cdbbcd-5x2dn                                     1/1     Running            0          141m
knative-eventing            broker-controller-5c84984b97-gv4bb                          1/1     Running            0          142m
knative-eventing            eventing-controller-54bfbd5446-qnmdk                        1/1     Running            0          142m
knative-eventing            eventing-webhook-58f56d9cf4-2cz5v                           1/1     Running            0          142m
knative-eventing            imc-controller-769896c7db-m2hkp                             1/1     Running            0          142m
knative-eventing            imc-dispatcher-86954fb4cd-bpwsz                             1/1     Running            0          142m
knative-serving             activator-75696c8c9-67sj2                                   1/1     Running            0          142m
knative-serving             autoscaler-6764f9b5c5-wmc98                                 1/1     Running            0          142m
knative-serving             controller-598fd8bfd7-c4wrs                                 1/1     Running            0          142m
knative-serving             istio-webhook-785bb58cc6-p7dcv                              1/1     Running            0          142m
knative-serving             networking-istio-77fbcfcf9b-l5gkv                           1/1     Running            0          142m
knative-serving             webhook-865f54cf5f-klcs4                                    1/1     Running            0          142m
kube-system                 coredns-6b84d75d99-2f5p4                                    1/1     Running            0          3h4m
kube-system                 coredns-6b84d75d99-rpvl7                                    1/1     Running            0          157m
kube-system                 coredns-autoscaler-5c4b6999d9-pp9xs                         1/1     Running            0          3h4m
kube-system                 kube-flannel-kjmlb                                          2/2     Running            0          157m
kube-system                 kube-flannel-vnhm7                                          2/2     Running            0          170m
kube-system                 metrics-server-7579449c57-2jqld                             1/1     Running            0          3h4m
kubeflow-user-example-com   ml-pipeline-ui-artifact-6d7ffcc4b6-rcghq                    2/2     Running            0          116m
kubeflow-user-example-com   ml-pipeline-visualizationserver-84d577b989-t49gf            2/2     Running            0          116m
kubeflow                    admission-webhook-deployment-6fb9d65887-vsf8h               1/1     Running            0          138m
kubeflow                    cache-deployer-deployment-7558d65bf4-s7bwk                  1/2     CrashLoopBackOff   19         138m
kubeflow                    cache-server-c64c68ddf-f7c9m                                0/2     Init:0/1           0          138m
kubeflow                    centraldashboard-7b7676d8bd-qt6g5                           1/1     Running            0          138m
kubeflow                    jupyter-web-app-deployment-66f74586d9-kts2c                 1/1     Running            0          98m
kubeflow                    katib-controller-77675c88df-gqp5k                           1/1     Running            0          138m
kubeflow                    katib-db-manager-646695754f-rwnpk                           1/1     Running            3          138m
kubeflow                    katib-mysql-5bb5bd9957-9zh8x                                1/1     Running            0          138m
kubeflow                    katib-ui-55fd4bd6f9-vcn6f                                   1/1     Running            0          138m
kubeflow                    kfserving-controller-manager-0                              2/2     Running            0          139m
kubeflow                    kubeflow-pipelines-profile-controller-5698bf57cf-8cvbw      1/1     Running            0          138m
kubeflow                    kubeflow-pipelines-profile-controller-5698bf57cf-tdjjm      1/1     Running            0          98m
kubeflow                    metacontroller-0                                            1/1     Running            0          139m
kubeflow                    metadata-envoy-deployment-76d65977f7-kcq7g                  1/1     Running            0          138m
kubeflow                    metadata-grpc-deployment-697d9c6c67-9t9zt                   2/2     Running            6          138m
kubeflow                    metadata-writer-58cdd57678-24gqw                            2/2     Running            2          138m
kubeflow                    minio-6d6784db95-lrr67                                      2/2     Running            0          98m
kubeflow                    ml-pipeline-85fc99f899-mwkrk                                2/2     Running            5          138m
kubeflow                    ml-pipeline-persistenceagent-65cb9594c7-hbzcx               2/2     Running            1          138m
kubeflow                    ml-pipeline-scheduledworkflow-7f8d8dfc69-c6lhl              2/2     Running            0          138m
kubeflow                    ml-pipeline-ui-5c765cc7bd-p4lmb                             2/2     Running            0          138m
kubeflow                    ml-pipeline-viewer-crd-5b8df7f458-tq2gp                     2/2     Running            1          138m
kubeflow                    ml-pipeline-visualizationserver-56c5ff68d5-stgm5            2/2     Running            0          138m
kubeflow                    mpi-operator-789f88879-v2bh7                                1/1     Running            0          138m
kubeflow                    mxnet-operator-7fff864957-5kc2w                             1/1     Running            0          138m
kubeflow                    mysql-56b554ff66-wvpg5                                      2/2     Running            0          98m
kubeflow                    notebook-controller-deployment-74d9584477-x2tpk             1/1     Running            0          138m
kubeflow                    profiles-deployment-67b4666796-js8k7                        2/2     Running            0          138m
kubeflow                    pytorch-operator-fd86f7694-8fcbc                            2/2     Running            0          138m
kubeflow                    tensorboard-controller-controller-manager-fd6bcffb4-6clhz   3/3     Running            1          138m
kubeflow                    tensorboards-web-app-deployment-78d7b8b658-chccs            1/1     Running            0          138m
kubeflow                    tf-job-operator-7bc5cf4cc7-txfbc                            1/1     Running            0          138m
kubeflow                    volumes-web-app-deployment-68fcfc9775-d9tdx                 1/1     Running            0          138m
kubeflow                    workflow-controller-5449754fb4-czlsb                        2/2     Running            2          137m
kubeflow                    xgboost-operator-deployment-5c7bfd57cc-2jrd7                2/2     Running            1          138m
local-path-storage          local-path-provisioner-5bd6f65fdf-j575f                     1/1     Running            0          147m
$ kubectl describe pod cache-deployer-deployment -n kubeflow
Name:         cache-deployer-deployment-7558d65bf4-s7bwk
Namespace:    kubeflow
Priority:     0
Node:         node1/10.102.13.9
Start Time:   Tue, 18 May 2021 15:12:33 +0800
Labels:       app=cache-deployer
              app.kubernetes.io/component=ml-pipeline
              app.kubernetes.io/name=kubeflow-pipelines
              application-crd-id=kubeflow-pipelines
              istio.io/rev=default
              pod-template-hash=7558d65bf4
              security.istio.io/tlsMode=istio
              service.istio.io/canonical-name=kubeflow-pipelines
              service.istio.io/canonical-revision=latest
Annotations:  kubectl.kubernetes.io/default-logs-container: main
              prometheus.io/path: /stats/prometheus
              prometheus.io/port: 15020
              prometheus.io/scrape: true
              sidecar.istio.io/status:
                {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istiod-ca-cert"],"ima...
Status:       Running
IP:           10.42.0.20
IPs:
  IP:           10.42.0.20
Controlled By:  ReplicaSet/cache-deployer-deployment-7558d65bf4
Init Containers:
  istio-init:
    Container ID:  docker://1ce4c15f6318caeb0fb9b258ef8bdc11e712f1d5be366584fc9805c9645a9f15
    Image:         docker.io/istio/proxyv2:1.9.0
    Image ID:      docker-pullable://istio/proxyv2@sha256:286b821197d7a9233d1d889119f090cd9a9394468d3a312f66ea24f6e16b2294
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x

      -b
      *
      -d
      15090,15021,15020
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 18 May 2021 15:13:40 +0800
      Finished:     Tue, 18 May 2021 15:13:40 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:        10m
      memory:     40Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kubeflow-pipelines-cache-deployer-sa-token-9cbqj (ro)
Containers:
  main:
    Container ID:   docker://6ce9304d9db0e6e3d2733b45cbe8539a920dad1ecf856d39e10ae87ff90d7b2d
    Image:          registry.cn-shenzhen.aliyuncs.com/tensorbytes/ml-pipeline-cache-deployer:1.5.0-rc.2-deb1e
    Image ID:       docker-pullable://registry.cn-shenzhen.aliyuncs.com/tensorbytes/ml-pipeline-cache-deployer@sha256:a13d49a4bee754f221697957d8491469bf2f958bbaac3d09707f053c8a4adf83
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Tue, 18 May 2021 17:30:07 +0800
      Finished:     Tue, 18 May 2021 17:31:00 +0800
    Ready:          False
    Restart Count:  19
    Environment:
      NAMESPACE_TO_WATCH:  kubeflow (v1:metadata.namespace)
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kubeflow-pipelines-cache-deployer-sa-token-9cbqj (ro)
  istio-proxy:
    Container ID:  docker://9053d64fa0acc159584b39a78cf482a564fc467edad0ad893a82b34f147ce346
    Image:         docker.io/istio/proxyv2:1.9.0
    Image ID:      docker-pullable://istio/proxyv2@sha256:286b821197d7a9233d1d889119f090cd9a9394468d3a312f66ea24f6e16b2294
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --serviceCluster
      cache-deployer.$(POD_NAMESPACE)
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
      --concurrency
      2
    State:          Running
      Started:      Tue, 18 May 2021 15:59:50 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      10m
      memory:   40Mi
    Readiness:  http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
    Environment:
      JWT_POLICY:                    first-party-jwt
      PILOT_CERT_PROVIDER:           istiod
      CA_ADDR:                       istiod.istio-system.svc:15012
      POD_NAME:                      cache-deployer-deployment-7558d65bf4-s7bwk (v1:metadata.name)
      POD_NAMESPACE:                 kubeflow (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      CANONICAL_SERVICE:              (v1:metadata.labels['service.istio.io/canonical-name'])
      CANONICAL_REVISION:             (v1:metadata.labels['service.istio.io/canonical-revision'])
      PROXY_CONFIG:                  {}

      ISTIO_META_POD_PORTS:          [
                                     ]
      ISTIO_META_APP_CONTAINERS:     main
      ISTIO_META_CLUSTER_ID:         Kubernetes
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_META_WORKLOAD_NAME:      cache-deployer-deployment
      ISTIO_META_OWNER:              kubernetes://apis/apps/v1/namespaces/kubeflow/deployments/cache-deployer-deployment
      ISTIO_META_MESH_ID:            cluster.local
      TRUST_DOMAIN:                  cluster.local
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/istio from istiod-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kubeflow-pipelines-cache-deployer-sa-token-9cbqj (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
      limits.cpu -> cpu-limit
      requests.cpu -> cpu-request
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  kubeflow-pipelines-cache-deployer-sa-token-9cbqj:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kubeflow-pipelines-cache-deployer-sa-token-9cbqj
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason   Age                   From                                Message
  ----     ------   ----                  ----                                -------
  Normal   Pulling  31m (x15 over 138m)   kubelet, node1  Pulling image "registry.cn-shenzhen.aliyuncs.com/tensorbytes/ml-pipeline-cache-deployer:1.5.0-rc.2-deb1e"
  Warning  BackOff  2m3s (x328 over 89m)  kubelet, node1  Back-off restarting failed container
$ kubectl describe pod cache-server -n kubeflow
Name:           cache-server-c64c68ddf-f7c9m
Namespace:      kubeflow
Priority:       0
Node:           node1/10.102.13.9
Start Time:     Tue, 18 May 2021 15:12:34 +0800
Labels:         app=cache-server
                app.kubernetes.io/component=ml-pipeline
                app.kubernetes.io/name=kubeflow-pipelines
                application-crd-id=kubeflow-pipelines
                istio.io/rev=default
                pod-template-hash=c64c68ddf
                security.istio.io/tlsMode=istio
                service.istio.io/canonical-name=kubeflow-pipelines
                service.istio.io/canonical-revision=latest
Annotations:    kubectl.kubernetes.io/default-logs-container: server
                prometheus.io/path: /stats/prometheus
                prometheus.io/port: 15020
                prometheus.io/scrape: true
                sidecar.istio.io/status:
                  {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istiod-ca-cert"],"ima...
Status:         Pending
IP:             
IPs:            <none>
Controlled By:  ReplicaSet/cache-server-c64c68ddf
Init Containers:
  istio-init:
    Container ID:  
    Image:         docker.io/istio/proxyv2:1.9.0
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x

      -b
      *
      -d
      15090,15021,15020
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:        10m
      memory:     40Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kubeflow-pipelines-cache-token-7pwl7 (ro)
Containers:
  server:
    Container ID:  
    Image:         registry.cn-shenzhen.aliyuncs.com/tensorbytes/ml-pipeline-cache-server:1.5.0-rc.2-a44df
    Image ID:      
    Port:          8443/TCP
    Host Port:     0/TCP
    Args:
      --db_driver=$(DBCONFIG_DRIVER)
      --db_host=$(DBCONFIG_HOST_NAME)
      --db_port=$(DBCONFIG_PORT)
      --db_name=$(DBCONFIG_DB_NAME)
      --db_user=$(DBCONFIG_USER)
      --db_password=$(DBCONFIG_PASSWORD)
      --namespace_to_watch=$(NAMESPACE_TO_WATCH)
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      NAMESPACE_TO_WATCH:  
      CACHE_IMAGE:         <set to the key 'cacheImage' of config map 'pipeline-install-config'>  Optional: false
      DBCONFIG_DRIVER:     mysql
      DBCONFIG_DB_NAME:    <set to the key 'cacheDb' of config map 'pipeline-install-config'>  Optional: false
      DBCONFIG_HOST_NAME:  <set to the key 'dbHost' of config map 'pipeline-install-config'>   Optional: false
      DBCONFIG_PORT:       <set to the key 'dbPort' of config map 'pipeline-install-config'>   Optional: false
      DBCONFIG_USER:       <set to the key 'username' in secret 'mysql-secret'>                Optional: false
      DBCONFIG_PASSWORD:   <set to the key 'password' in secret 'mysql-secret'>                Optional: false
    Mounts:
      /etc/webhook/certs from webhook-tls-certs (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kubeflow-pipelines-cache-token-7pwl7 (ro)
  istio-proxy:
    Container ID:  
    Image:         docker.io/istio/proxyv2:1.9.0
    Image ID:      
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --serviceCluster
      cache-server.$(POD_NAMESPACE)
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
      --concurrency
      2
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      10m
      memory:   40Mi
    Readiness:  http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
    Environment:
      JWT_POLICY:                    first-party-jwt
      PILOT_CERT_PROVIDER:           istiod
      CA_ADDR:                       istiod.istio-system.svc:15012
      POD_NAME:                      cache-server-c64c68ddf-f7c9m (v1:metadata.name)
      POD_NAMESPACE:                 kubeflow (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      CANONICAL_SERVICE:              (v1:metadata.labels['service.istio.io/canonical-name'])
      CANONICAL_REVISION:             (v1:metadata.labels['service.istio.io/canonical-revision'])
      PROXY_CONFIG:                  {}

      ISTIO_META_POD_PORTS:          [
                                         {"name":"webhook-api","containerPort":8443,"protocol":"TCP"}
                                     ]
      ISTIO_META_APP_CONTAINERS:     server
      ISTIO_META_CLUSTER_ID:         Kubernetes
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_META_WORKLOAD_NAME:      cache-server
      ISTIO_META_OWNER:              kubernetes://apis/apps/v1/namespaces/kubeflow/deployments/cache-server
      ISTIO_META_MESH_ID:            cluster.local
      TRUST_DOMAIN:                  cluster.local
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/istio from istiod-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kubeflow-pipelines-cache-token-7pwl7 (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
      limits.cpu -> cpu-limit
      requests.cpu -> cpu-request
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  webhook-tls-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  webhook-server-tls
    Optional:    false
  kubeflow-pipelines-cache-token-7pwl7:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  kubeflow-pipelines-cache-token-7pwl7
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason       Age                    From                                Message
  ----     ------       ----                   ----                                -------
  Warning  FailedMount  49m (x5 over 124m)     kubelet, node1  Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istio-data istio-envoy istio-podinfo kubeflow-pipelines-cache-token-7pwl7 webhook-tls-certs istiod-ca-cert]: timed out waiting for the condition
  Warning  FailedMount  24m (x17 over 133m)    kubelet, node1  Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[kubeflow-pipelines-cache-token-7pwl7 webhook-tls-certs istiod-ca-cert istio-data istio-envoy istio-podinfo]: timed out waiting for the condition
  Warning  FailedMount  19m (x8 over 121m)     kubelet, node1  Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[istio-envoy istio-podinfo kubeflow-pipelines-cache-token-7pwl7 webhook-tls-certs istiod-ca-cert istio-data]: timed out waiting for the condition
  Warning  FailedMount  9m38s (x72 over 139m)  kubelet, node1  MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found
  Warning  FailedMount  3m58s (x10 over 130m)  kubelet, node1  Unable to attach or mount volumes: unmounted volumes=[webhook-tls-certs], unattached volumes=[webhook-tls-certs istiod-ca-cert istio-data istio-envoy istio-podinfo kubeflow-pipelines-cache-token-7pwl7]: timed out waiting for the condition

kubernetes 1.17.17

HarborZeng commented 3 years ago
kubectl get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                          STORAGECLASS   REASON   AGE
pvc-4ab26a02-4566-4574-bed6-499f922898a1   20Gi       RWO            Delete           Bound    kubeflow/minio-pvc             local-path              96m
pvc-578440fd-f835-4820-bf8a-bd4f2ed836d3   10Gi       RWO            Delete           Bound    istio-system/authservice-pvc   local-path              152m
pvc-5b3e06ac-8ba9-4af8-afae-2201fbb47b9a   10Gi       RWO            Delete           Bound    kubeflow/katib-mysql           local-path              148m
pvc-816d2df6-835c-4ac7-a8b7-e246d68107dd   20Gi       RWO            Delete           Bound    kubeflow/mysql-pv-claim        local-path              97m
kubectl get certs -A
NAMESPACE   NAME                     READY   SECRET                          AGE
kubeflow    admission-webhook-cert   True    webhook-certs                   110m
kubeflow    katib-webhook-cert       True    katib-webhook-cert              110m
kubeflow    serving-cert             True    kfserving-webhook-server-cert   110m
kubectl get secret -A
NAMESPACE                   NAME                                                 TYPE                                  DATA   AGE
auth                        default-token-dr667                                  kubernetes.io/service-account-token   3      155m
auth                        dex-oidc-client                                      Opaque                                2      155m
auth                        dex-token-xg5nb                                      kubernetes.io/service-account-token   3      155m
cattle-system               cattle-credentials-e9faa6f                           Opaque                                3      5h27m
cattle-system               cattle-token-tgmq6                                   kubernetes.io/service-account-token   3      5h27m
cattle-system               default-token-6hp5w                                  kubernetes.io/service-account-token   3      5h27m
cattle-system               kontainer-engine-token-677kr                         kubernetes.io/service-account-token   3      5h27m
cert-manager                cert-manager-cainjector-token-gbgnh                  kubernetes.io/service-account-token   3      155m
cert-manager                cert-manager-token-mmttc                             kubernetes.io/service-account-token   3      155m
cert-manager                cert-manager-webhook-ca                              kubernetes.io/tls                     3      155m
cert-manager                cert-manager-webhook-tls                             kubernetes.io/tls                     3      155m
cert-manager                cert-manager-webhook-token-cwn4z                     kubernetes.io/service-account-token   3      155m
cert-manager                default-token-sf6gn                                  kubernetes.io/service-account-token   3      155m
default                     default-token-bhp26                                  kubernetes.io/service-account-token   3      5h28m
ingress-nginx               default-token-w62jx                                  kubernetes.io/service-account-token   3      5h27m
ingress-nginx               nginx-ingress-serviceaccount-token-56f7d             kubernetes.io/service-account-token   3      5h27m
istio-system                cluster-local-gateway-service-account-token-vwfcd    kubernetes.io/service-account-token   3      154m
istio-system                default-token-vn9cc                                  kubernetes.io/service-account-token   3      155m
istio-system                istio-ca-secret                                      istio.io/ca-root                      5      151m
istio-system                istio-ingressgateway-service-account-token-txqwn     kubernetes.io/service-account-token   3      155m
istio-system                istio-reader-service-account-token-xnw4g             kubernetes.io/service-account-token   3      155m
istio-system                istiod-service-account-token-b6k42                   kubernetes.io/service-account-token   3      155m
istio-system                oidc-authservice-client                              Opaque                                2      155m
knative-eventing            default-token-v8jpw                                  kubernetes.io/service-account-token   3      155m
knative-eventing            eventing-controller-token-xwklj                      kubernetes.io/service-account-token   3      155m
knative-eventing            eventing-webhook-certs                               Opaque                                3      155m
knative-eventing            eventing-webhook-token-nf9lq                         kubernetes.io/service-account-token   3      155m
knative-eventing            imc-controller-token-lm8cc                           kubernetes.io/service-account-token   3      155m
knative-eventing            imc-dispatcher-token-q2m8l                           kubernetes.io/service-account-token   3      155m
knative-eventing            pingsource-jobrunner-token-v2mz9                     kubernetes.io/service-account-token   3      155m
knative-serving             controller-token-bpfct                               kubernetes.io/service-account-token   3      155m
knative-serving             default-token-r74bf                                  kubernetes.io/service-account-token   3      155m
knative-serving             istio-webhook-certs                                  Opaque                                3      155m
knative-serving             webhook-certs                                        Opaque                                3      155m
kube-node-lease             default-token-xm4q5                                  kubernetes.io/service-account-token   3      5h28m
kube-public                 default-token-ds2jt                                  kubernetes.io/service-account-token   3      5h28m
kube-system                 attachdetach-controller-token-2fgjg                  kubernetes.io/service-account-token   3      5h28m
kube-system                 certificate-controller-token-r2rvx                   kubernetes.io/service-account-token   3      5h28m
kube-system                 clusterrole-aggregation-controller-token-sxrfk       kubernetes.io/service-account-token   3      5h28m
kube-system                 coredns-autoscaler-token-bsbrs                       kubernetes.io/service-account-token   3      5h27m
kube-system                 coredns-token-vxfqb                                  kubernetes.io/service-account-token   3      5h27m
kube-system                 cronjob-controller-token-62lqw                       kubernetes.io/service-account-token   3      5h28m
kube-system                 daemon-set-controller-token-24t97                    kubernetes.io/service-account-token   3      5h28m
kube-system                 default-token-56s4x                                  kubernetes.io/service-account-token   3      5h28m
kube-system                 deployment-controller-token-6bjkh                    kubernetes.io/service-account-token   3      5h28m
kube-system                 disruption-controller-token-klcws                    kubernetes.io/service-account-token   3      5h28m
kube-system                 endpoint-controller-token-vxvjs                      kubernetes.io/service-account-token   3      5h28m
kube-system                 expand-controller-token-29hz5                        kubernetes.io/service-account-token   3      5h28m
kube-system                 flannel-token-np54t                                  kubernetes.io/service-account-token   3      5h27m
kube-system                 generic-garbage-collector-token-gpnr4                kubernetes.io/service-account-token   3      5h28m
kube-system                 horizontal-pod-autoscaler-token-9fpnh                kubernetes.io/service-account-token   3      5h28m
kube-system                 job-controller-token-s96hf                           kubernetes.io/service-account-token   3      5h28m
kube-system                 metrics-server-token-wx5gf                           kubernetes.io/service-account-token   3      5h27m
kube-system                 namespace-controller-token-9gcx8                     kubernetes.io/service-account-token   3      5h28m
kube-system                 node-controller-token-5tzf7                          kubernetes.io/service-account-token   3      5h28m
kube-system                 persistent-volume-binder-token-559jd                 kubernetes.io/service-account-token   3      5h28m
kube-system                 pod-garbage-collector-token-268jt                    kubernetes.io/service-account-token   3      5h28m
kube-system                 pv-protection-controller-token-7zzxj                 kubernetes.io/service-account-token   3      5h28m
kube-system                 pvc-protection-controller-token-58h6j                kubernetes.io/service-account-token   3      5h28m
kube-system                 replicaset-controller-token-8mw42                    kubernetes.io/service-account-token   3      5h28m
kube-system                 replication-controller-token-4zcqw                   kubernetes.io/service-account-token   3      5h28m
kube-system                 resourcequota-controller-token-rrm7j                 kubernetes.io/service-account-token   3      5h28m
kube-system                 rke-job-deployer-token-l7llh                         kubernetes.io/service-account-token   3      5h28m
kube-system                 rke-job-deployer-token-zpcnl                         kubernetes.io/service-account-token   3      169m
kube-system                 service-account-controller-token-lcxkt               kubernetes.io/service-account-token   3      5h28m
kube-system                 service-controller-token-qsdbl                       kubernetes.io/service-account-token   3      5h28m
kube-system                 statefulset-controller-token-8xt4g                   kubernetes.io/service-account-token   3      5h28m
kube-system                 ttl-controller-token-kps8h                           kubernetes.io/service-account-token   3      5h28m
kubeflow-user-example-com   default-editor-token-2jn5x                           kubernetes.io/service-account-token   3      129m
kubeflow-user-example-com   default-token-85ff8                                  kubernetes.io/service-account-token   3      129m
kubeflow-user-example-com   default-viewer-token-pzhq8                           kubernetes.io/service-account-token   3      129m
kubeflow-user-example-com   mlpipeline-minio-artifact                            Opaque                                2      129m
kubeflow                    admission-webhook-service-account-token-vcbnz        kubernetes.io/service-account-token   3      154m
kubeflow                    argo-token-fvvmc                                     kubernetes.io/service-account-token   3      154m
kubeflow                    centraldashboard-token-dvxd9                         kubernetes.io/service-account-token   3      154m
kubeflow                    default-token-fcc5w                                  kubernetes.io/service-account-token   3      154m
kubeflow                    jupyter-web-app-service-account-token-cgkpr          kubernetes.io/service-account-token   3      154m
kubeflow                    katib-controller-token-rn42w                         kubernetes.io/service-account-token   3      154m
kubeflow                    katib-mysql-secrets                                  Opaque                                1      154m
kubeflow                    katib-ui-token-gf6qc                                 kubernetes.io/service-account-token   3      154m
kubeflow                    katib-webhook-cert                                   kubernetes.io/tls                     3      111m
kubeflow                    kfserving-webhook-server-cert                        kubernetes.io/tls                     3      111m
kubeflow                    kfserving-webhook-server-secret                      Opaque                                0      154m
kubeflow                    kubeflow-pipelines-cache-deployer-sa-token-9cbqj     kubernetes.io/service-account-token   3      154m
kubeflow                    kubeflow-pipelines-cache-token-7pwl7                 kubernetes.io/service-account-token   3      154m
kubeflow                    kubeflow-pipelines-container-builder-token-nghbx     kubernetes.io/service-account-token   3      154m
kubeflow                    kubeflow-pipelines-metadata-writer-token-bk84c       kubernetes.io/service-account-token   3      154m
kubeflow                    kubeflow-pipelines-viewer-token-qhmst                kubernetes.io/service-account-token   3      154m
kubeflow                    meta-controller-service-token-465dk                  kubernetes.io/service-account-token   3      154m
kubeflow                    metadata-grpc-server-token-wd2q5                     kubernetes.io/service-account-token   3      154m
kubeflow                    ml-pipeline-persistenceagent-token-ppp6v             kubernetes.io/service-account-token   3      154m
kubeflow                    ml-pipeline-scheduledworkflow-token-4tjvm            kubernetes.io/service-account-token   3      154m
kubeflow                    ml-pipeline-token-xgnqr                              kubernetes.io/service-account-token   3      154m
kubeflow                    ml-pipeline-ui-token-4fbg6                           kubernetes.io/service-account-token   3      154m
kubeflow                    ml-pipeline-viewer-crd-service-account-token-vq78r   kubernetes.io/service-account-token   3      154m
kubeflow                    ml-pipeline-visualizationserver-token-9p5wh          kubernetes.io/service-account-token   3      154m
kubeflow                    mlpipeline-minio-artifact                            Opaque                                2      154m
kubeflow                    mpi-operator-token-t5pc2                             kubernetes.io/service-account-token   3      154m
kubeflow                    mxnet-operator-token-m8b5k                           kubernetes.io/service-account-token   3      154m
kubeflow                    mysql-secret                                         Opaque                                2      154m
kubeflow                    mysql-token-dlmfc                                    kubernetes.io/service-account-token   3      154m
kubeflow                    notebook-controller-service-account-token-wrs8f      kubernetes.io/service-account-token   3      154m
kubeflow                    pipeline-runner-token-4w59r                          kubernetes.io/service-account-token   3      154m
kubeflow                    profiles-controller-service-account-token-j966m      kubernetes.io/service-account-token   3      154m
kubeflow                    pytorch-operator-token-l2xk4                         kubernetes.io/service-account-token   3      154m
kubeflow                    tensorboard-controller-token-ffwrg                   kubernetes.io/service-account-token   3      154m
kubeflow                    tensorboards-web-app-service-account-token-fmwbr     kubernetes.io/service-account-token   3      154m
kubeflow                    tf-job-operator-token-wbwvx                          kubernetes.io/service-account-token   3      154m
kubeflow                    volumes-web-app-service-account-token-4khfs          kubernetes.io/service-account-token   3      154m
kubeflow                    webhook-certs                                        kubernetes.io/tls                     3      111m
kubeflow                    xgboost-operator-service-account-token-gmj65         kubernetes.io/service-account-token   3      154m
local-path-storage          default-token-8np4s                                  kubernetes.io/service-account-token   3      160m
local-path-storage          local-path-provisioner-service-account-token-5bl2n   kubernetes.io/service-account-token   3      160m
security-scan               default-token-p7mnr                                  kubernetes.io/service-account-token   3      5h27m
shikanon commented 3 years ago

@HarborZeng your problem like is mutatingwebhookconfigurations problem. you can use the command to see it:

$ kubectl get mutatingwebhookconfigurations -A
NAME                                               WEBHOOKS   AGE
admission-webhook-mutating-webhook-configuration   1          23h
cache-webhook-kubeflow                             1          23h
cert-manager-webhook                               1          23h
inferenceservice.serving.kubeflow.org              3          23h
istio-sidecar-injector                             1          30d
katib.kubeflow.org                                 2          23h
sinkbindings.webhook.sources.knative.dev           1          23h
webhook.eventing.knative.dev                       1          23h
webhook.istio.networking.internal.knative.dev      1          23h
webhook.serving.knative.dev                        1          23h

if you have cache-webhook-kubeflow, you can see the issue: https://github.com/kubeflow/pipelines/issues/3815#issuecomment-643651401

shikanon commented 3 years ago

and you also can read this patch: https://github.com/kubeflow/pipelines/pull/3992/commits/2789657496a296c3275f92a8492f50423d7ed13f

if cache-webhook-kubeflow is in the mutatingwebhookconfigurations and webhook-tls-certs is not in secret, you can delete cache-webhook-kubeflow and reinstall it:

kubectl delete mutatingwebhookconfigurations cache-webhook-kubeflow
kubectl delete -f mainfest1.3/
python install.py
HarborZeng commented 3 years ago

@shikanon 我的 mutatingwebhookconfigurations 里面并没有 cache-webhook-kubeflow

$ kubectl get mutatingwebhookconfigurations -A
NAME                                               CREATED AT
admission-webhook-mutating-webhook-configuration   2021-05-20T03:03:36Z
cert-manager-webhook                               2021-05-20T03:02:30Z
inferenceservice.serving.kubeflow.org              2021-05-20T03:03:29Z
istio-sidecar-injector                             2021-05-20T03:02:36Z
katib.kubeflow.org                                 2021-05-20T03:03:32Z
sinkbindings.webhook.sources.knative.dev           2021-05-20T03:02:57Z
webhook.eventing.knative.dev                       2021-05-20T03:02:57Z
webhook.istio.networking.internal.knative.dev      2021-05-20T03:02:52Z
webhook.serving.knative.dev                        2021-05-20T03:02:52Z

我又查了查资料,最终这里,找到了原因,重新 python install.py 之后,pods 终于启动了

kubeflow                    admission-webhook-deployment-54cf94d964-8qsh2               1/1     Running     0          47m
kubeflow                    cache-deployer-deployment-65cd55d4d9-d6dzd                  2/2     Running     11         47m
kubeflow                    cache-server-f85c69486-rgzq6                                2/2     Running     0          47m
kubeflow                    centraldashboard-7b7676d8bd-w5jw6                           1/1     Running     0          50m
WMeng1 commented 3 years ago

想请问一下,关联数据库的pod是怎么起来的,我的katib-db-manager和katib-mysql起不来,错误信息如下:

katib-db-manager: E0827 00:35:04.510080 1 mysql.go:78] Ping to Katib db failed: dial tcp 10.96.12.87:3306: connect: connection refused E0827 00:35:09.467185 1 mysql.go:78] Ping to Katib db failed: dial tcp 10.96.12.87:3306: connect: connection refused E0827 00:35:14.467236 1 mysql.go:78] Ping to Katib db failed: dial tcp 10.96.12.87:3306: connect: connection refused E0827 00:35:19.466911 1 mysql.go:78] Ping to Katib db failed: dial tcp 10.96.12.87:3306: connect: connection refused

katib:mysql:

mysqld: Table 'mysql.plugin' doesn't exist 2021-08-27T00:38:50.580357Z 0 [ERROR] [MY-010735] [Server] Could not open the mysql.plugin table. Please perform the MySQL upgrade procedure. 2021-08-27T00:38:50.581584Z 0 [Warning] [MY-010441] [Server] Failed to open optimizer cost constant tables 2021-08-27T00:38:50.582534Z 0 [Warning] [MY-010441] [Server] Failed to open optimizer cost constant tables 2021-08-27T00:38:50.583506Z 0 [Warning] [MY-010441] [Server] Failed to open optimizer cost constant tables 2021-08-27T00:38:50.584457Z 0 [Warning] [MY-010441] [Server] Failed to open optimizer cost constant tables 2021-08-27T00:38:50.585447Z 0 [Warning] [MY-010441] [Server] Failed to open optimizer cost constant tables 2021-08-27T00:38:50.588551Z 0 [Warning] [MY-010441] [Server] Failed to open optimizer cost constant tables 2021-08-27T00:38:50.589593Z 0 [Warning] [MY-010441] [Server] Failed to open optimizer cost constant tables

HarborZeng commented 3 years ago

想请问一下,关联数据库的pod是怎么起来的,我的katib-db-manager和katib-mysql起不来,错误信息如下:

katib-db-manager: E0827 00:35:04.510080 1 mysql.go:78] Ping to Katib db failed: dial tcp 10.96.12.87:3306: connect: connection refused E0827 00:35:09.467185 1 mysql.go:78] Ping to Katib db failed: dial tcp 10.96.12.87:3306: connect: connection refused E0827 00:35:14.467236 1 mysql.go:78] Ping to Katib db failed: dial tcp 10.96.12.87:3306: connect: connection refused E0827 00:35:19.466911 1 mysql.go:78] Ping to Katib db failed: dial tcp 10.96.12.87:3306: connect: connection refused

katib:mysql:

mysqld: Table 'mysql.plugin' doesn't exist 2021-08-27T00:38:50.580357Z 0 [ERROR] [MY-010735] [Server] Could not open the mysql.plugin table. Please perform the MySQL upgrade procedure. 2021-08-27T00:38:50.581584Z 0 [Warning] [MY-010441] [Server] Failed to open optimizer cost constant tables 2021-08-27T00:38:50.582534Z 0 [Warning] [MY-010441] [Server] Failed to open optimizer cost constant tables 2021-08-27T00:38:50.583506Z 0 [Warning] [MY-010441] [Server] Failed to open optimizer cost constant tables 2021-08-27T00:38:50.584457Z 0 [Warning] [MY-010441] [Server] Failed to open optimizer cost constant tables 2021-08-27T00:38:50.585447Z 0 [Warning] [MY-010441] [Server] Failed to open optimizer cost constant tables 2021-08-27T00:38:50.588551Z 0 [Warning] [MY-010441] [Server] Failed to open optimizer cost constant tables 2021-08-27T00:38:50.589593Z 0 [Warning] [MY-010441] [Server] Failed to open optimizer cost constant tables

说实话,全靠运气,反正不行就删了重来,总有一次会成功。。。不过后来我发现,就算数据库这个pod起不来,也不影响notebook server的使用

shikanon commented 3 years ago

@HarborZeng @WMeng1 数据库这个pod 是很简单实现的,可以看这个yaml https://github.com/shikanon/kubeflow-manifests/blob/50ee9f1e0aef5f69620db89c9ae2f81c9b2d96e3/manifest1.3/019-katib-installs-katib-with-kubeflow-cert-manager.yaml#L620 挂载了一个数据盘和设置了账号密码,你们自己也可以其一个这个名字的deployment代替。

你们删除时候保证 pvc 被删除了应该是不会出问题的:

kubectl get pvc -A

来查看是否相关PVC都被卸载了