kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.56k stars 1.61k forks source link

ML-Pipelines API Server and Metadata Writer in CrashLoopBackoff #6121

Closed ReggieCarey closed 6 months ago

ReggieCarey commented 3 years ago

What steps did you take

I deployed Kubeflow 1.3 by using the manifests approach. I then repaired an issue with dex running on K8s v1.21

What happened:

The installation succeeded. All processes started up except the two. Both Metadata writer and ml-pipeline crash constantly and are restarted. ML-Pipeline always reports 1 of 2 running. Metadata-writer sometimes appears to be fully running then fails. No other kubeflow pods are having problems like this - even the mysql pod seems stable. I can only assume the failure of the metadata writer is due to a continued failure in ml-pipeline api-server.

The pod keeps getting terminated by something with a reason code of 137. See last image provided for details on the cycle time.

What did you expect to happen:

I expect that the pipeline tools install and operate normally. This has been a consistent problem going back to KF 1.1 with no adequate resolution

Environment:

I use the kubeflow 1.3 manifests deployment approach

This install is via the kubeflow 1.3.

NOT APPLICABLE

Anything else you would like to add:

kubectl logs ml-pipeline-9b68d49cb-x67mp ml-pipeline-api-server

I0723 15:20:24.692579       8 client_manager.go:154] Initializing client manager
I0723 15:20:24.692646       8 config.go:57] Config DBConfig.ExtraParams not specified, skipping

kubectl describe pod ml-pipeline-9b68d49cb-x67mp

Name:         ml-pipeline-9b68d49cb-x67mp
Namespace:    kubeflow
Priority:     0
Node:         cpu-compute-09/10.164.208.67
Start Time:   Tue, 20 Jul 2021 21:27:57 -0400
Labels:       app=ml-pipeline
              app.kubernetes.io/component=ml-pipeline
              app.kubernetes.io/name=kubeflow-pipelines
              application-crd-id=kubeflow-pipelines
              istio.io/rev=default
              pod-template-hash=9b68d49cb
              security.istio.io/tlsMode=istio
              service.istio.io/canonical-name=kubeflow-pipelines
              service.istio.io/canonical-revision=latest
Annotations:  cluster-autoscaler.kubernetes.io/safe-to-evict: true
              kubectl.kubernetes.io/default-logs-container: ml-pipeline-api-server
              prometheus.io/path: /stats/prometheus
              prometheus.io/port: 15020
              prometheus.io/scrape: true
              sidecar.istio.io/status:
                {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-...
Status:       Running
IP:           10.0.4.21
IPs:
  IP:           10.0.4.21
Controlled By:  ReplicaSet/ml-pipeline-9b68d49cb
Init Containers:
  istio-init:
    Container ID:  docker://db62120288183c6d962e0bfb60db7780fa7bb8c9e231bc9f48976a10c1b29587
    Image:         docker.io/istio/proxyv2:1.9.0
    Image ID:      docker-pullable://istio/proxyv2@sha256:286b821197d7a9233d1d889119f090cd9a9394468d3a312f66ea24f6e16b2294
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x

      -b
      *
      -d
      15090,15021,15020
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 20 Jul 2021 21:28:19 -0400
      Finished:     Tue, 20 Jul 2021 21:28:19 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:        10m
      memory:     40Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8csrh (ro)
Containers:
  ml-pipeline-api-server:
    Container ID:   docker://0a4b0d31179f67cc38ddb5ebb8eb31b32344c80fe9e4789ef20c073b02c5335b
    Image:          gcr.io/ml-pipeline/api-server:1.5.0
    Image ID:       docker-pullable://gcr.io/ml-pipeline/api-server@sha256:0d90705712e201ca7102336e4bd6ff794e7f76facdac2c6e82134294706d78ca
    Ports:          8888/TCP, 8887/TCP
    Host Ports:     0/TCP, 0/TCP
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    137
      Started:      Fri, 23 Jul 2021 11:13:49 -0400
      Finished:     Fri, 23 Jul 2021 11:14:34 -0400
    Ready:          False
    Restart Count:  1117
    Requests:
      cpu:      250m
      memory:   500Mi
    Liveness:   exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
    Readiness:  exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3
    Environment Variables from:
      pipeline-api-server-config-dc9hkg52h6  ConfigMap  Optional: false
    Environment:
      KUBEFLOW_USERID_HEADER:                kubeflow-userid
      KUBEFLOW_USERID_PREFIX:                
      AUTO_UPDATE_PIPELINE_DEFAULT_VERSION:  <set to the key 'autoUpdatePipelineDefaultVersion' of config map 'pipeline-install-config'>  Optional: false
      POD_NAMESPACE:                         kubeflow (v1:metadata.namespace)
      OBJECTSTORECONFIG_SECURE:              false
      OBJECTSTORECONFIG_BUCKETNAME:          <set to the key 'bucketName' of config map 'pipeline-install-config'>  Optional: false
      DBCONFIG_USER:                         <set to the key 'username' in secret 'mysql-secret'>                   Optional: false
      DBCONFIG_PASSWORD:                     <set to the key 'password' in secret 'mysql-secret'>                   Optional: false
      DBCONFIG_DBNAME:                       <set to the key 'pipelineDb' of config map 'pipeline-install-config'>  Optional: false
      DBCONFIG_HOST:                         <set to the key 'dbHost' of config map 'pipeline-install-config'>      Optional: false
      DBCONFIG_PORT:                         <set to the key 'dbPort' of config map 'pipeline-install-config'>      Optional: false
      OBJECTSTORECONFIG_ACCESSKEY:           <set to the key 'accesskey' in secret 'mlpipeline-minio-artifact'>     Optional: false
      OBJECTSTORECONFIG_SECRETACCESSKEY:     <set to the key 'secretkey' in secret 'mlpipeline-minio-artifact'>     Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8csrh (ro)
  istio-proxy:
    Container ID:  docker://6cd34842733729c0743c0ce153a6b15614da748e72a2352616cdf6d10eb9a997
    Image:         docker.io/istio/proxyv2:1.9.0
    Image ID:      docker-pullable://istio/proxyv2@sha256:286b821197d7a9233d1d889119f090cd9a9394468d3a312f66ea24f6e16b2294
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --serviceCluster
      ml-pipeline.$(POD_NAMESPACE)
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
      --concurrency
      2
    State:          Running
      Started:      Tue, 20 Jul 2021 21:28:36 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      10m
      memory:   40Mi
    Readiness:  http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
    Environment:
      JWT_POLICY:                    third-party-jwt
      PILOT_CERT_PROVIDER:           istiod
      CA_ADDR:                       istiod.istio-system.svc:15012
      POD_NAME:                      ml-pipeline-9b68d49cb-x67mp (v1:metadata.name)
      POD_NAMESPACE:                 kubeflow (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      CANONICAL_SERVICE:              (v1:metadata.labels['service.istio.io/canonical-name'])
      CANONICAL_REVISION:             (v1:metadata.labels['service.istio.io/canonical-revision'])
      PROXY_CONFIG:                  {}

      ISTIO_META_POD_PORTS:          [
                                         {"name":"http","containerPort":8888,"protocol":"TCP"}
                                         ,{"name":"grpc","containerPort":8887,"protocol":"TCP"}
                                     ]
      ISTIO_META_APP_CONTAINERS:     ml-pipeline-api-server
      ISTIO_META_CLUSTER_ID:         Kubernetes
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_METAJSON_ANNOTATIONS:    {"cluster-autoscaler.kubernetes.io/safe-to-evict":"true"}

      ISTIO_META_WORKLOAD_NAME:      ml-pipeline
      ISTIO_META_OWNER:              kubernetes://apis/apps/v1/namespaces/kubeflow/deployments/ml-pipeline
      ISTIO_META_MESH_ID:            cluster.local
      TRUST_DOMAIN:                  cluster.local
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/istio from istiod-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-8csrh (ro)
      /var/run/secrets/tokens from istio-token (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
      limits.cpu -> cpu-limit
      requests.cpu -> cpu-request
  istio-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  43200
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  kube-api-access-8csrh:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                       From     Message
  ----     ------     ----                      ----     -------
  Warning  BackOff    12m (x13622 over 2d13h)   kubelet  Back-off restarting failed container
  Warning  Unhealthy  8m2s (x9931 over 2d13h)   kubelet  Readiness probe failed:
  Normal   Pulled     2m58s (x1116 over 2d13h)  kubelet  Container image "gcr.io/ml-pipeline/api-server:1.5.0" already present on machine

image

image

Metadata-Writer Logs:

Failed to access the Metadata store. Exception: "upstream connect error or disconnect/reset before headers. reset reason: connection failure"
Failed to access the Metadata store. Exception: "upstream connect error or disconnect/reset before headers. reset reason: connection failure"
Failed to access the Metadata store. Exception: "upstream connect error or disconnect/reset before headers. reset reason: connection failure"
Failed to access the Metadata store. Exception: "upstream connect error or disconnect/reset before headers. reset reason: connection failure"
Traceback (most recent call last):
  File "/kfp/metadata_writer/metadata_writer.py", line 63, in <module>
    mlmd_store = connect_to_mlmd()
  File "/kfp/metadata_writer/metadata_helpers.py", line 62, in connect_to_mlmd
    raise RuntimeError('Could not connect to the Metadata store.')
RuntimeError: Could not connect to the Metadata store.

Labels

/area backend


Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

Bobgy commented 3 years ago

Hi @ReggieCarey! Can you give output of kubectl logs ml-pipeline-9b68d49cb-x67mp ml-pipeline-api-server --previous? that should include logs from last failed container.

Bobgy commented 3 years ago

For metadata-writer, what is status of metadata-grpc-server? metadata-writer fails to connect to the store

ReggieCarey commented 3 years ago

Thanks Yuan Gong,

As per request:

Thu Aug 05 15:43:51 $ kubectl logs ml-pipeline-9b68d49cb-6z2tq ml-pipeline-api-server --previous
I0806 00:55:18.818889       7 client_manager.go:154] Initializing client manager
I0806 00:55:18.818963       7 config.go:57] Config DBConfig.ExtraParams not specified, skipping

Sadly, that's it.

Reggie

ReggieCarey commented 3 years ago

For metadata-writer issue: The status of metadata-grpc-deployment is stable(ish). It has restarted 533 times.

Here is output from "describe", the logs for "container" are empty

Thu Aug 05 15:42:03 $ kubectl describe pod metadata-grpc-deployment-c8f784fdf-hdvgr 
metacontroller-0
metadata-envoy-deployment-6756c995c9-fl7gb
metadata-grpc-deployment-c8f784fdf-hdvgr
metadata-writer-6bf5cfd7d8-v7fzb
Thu Aug 05 15:42:03 $ kubectl describe pod metadata-grpc-deployment-c8f784fdf-hdvgr 
Name:         metadata-grpc-deployment-c8f784fdf-hdvgr
Namespace:    kubeflow
Priority:     0
Node:         cpu-compute-05/10.164.208.183
Start Time:   Tue, 20 Jul 2021 21:27:57 -0400
Labels:       application-crd-id=kubeflow-pipelines
              component=metadata-grpc-server
              istio.io/rev=default
              pod-template-hash=c8f784fdf
              security.istio.io/tlsMode=istio
              service.istio.io/canonical-name=metadata-grpc-deployment
              service.istio.io/canonical-revision=latest
Annotations:  kubectl.kubernetes.io/default-logs-container: container
              prometheus.io/path: /stats/prometheus
              prometheus.io/port: 15020
              prometheus.io/scrape: true
              sidecar.istio.io/status:
                {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-...
Status:       Running
IP:           10.0.0.135
IPs:
  IP:           10.0.0.135
Controlled By:  ReplicaSet/metadata-grpc-deployment-c8f784fdf
Init Containers:
  istio-init:
    Container ID:  docker://b6c7ccd562a209578a835ec5f5a0b3799e150f2c3be4e428be03788591c707e7
    Image:         docker.io/istio/proxyv2:1.9.0
    Image ID:      docker-pullable://istio/proxyv2@sha256:286b821197d7a9233d1d889119f090cd9a9394468d3a312f66ea24f6e16b2294
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x

      -b
      *
      -d
      15090,15021,15020
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 29 Jul 2021 18:15:54 -0400
      Finished:     Thu, 29 Jul 2021 18:15:54 -0400
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:        10m
      memory:     40Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4qdbv (ro)
Containers:
  container:
    Container ID:  docker://b5c4b93b8034fdfed2181814fd9b37ff9fe2b154cf3163a988bc4ab4f29e202c
    Image:         gcr.io/tfx-oss-public/ml_metadata_store_server:0.25.1
    Image ID:      docker-pullable://gcr.io/tfx-oss-public/ml_metadata_store_server@sha256:01691247116fe048e0761ae8033efaad3ddd82438d0198f2235afb37c1757d48
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      /bin/metadata_store_server
    Args:
      --grpc_port=8080
      --mysql_config_database=$(MYSQL_DATABASE)
      --mysql_config_host=$(MYSQL_HOST)
      --mysql_config_port=$(MYSQL_PORT)
      --mysql_config_user=$(DBCONFIG_USER)
      --mysql_config_password=$(DBCONFIG_PASSWORD)
      --enable_database_upgrade=true
    State:          Running
      Started:      Thu, 05 Aug 2021 20:36:26 -0400
    Last State:     Terminated
      Reason:       Error
      Exit Code:    139
      Started:      Thu, 05 Aug 2021 19:36:25 -0400
      Finished:     Thu, 05 Aug 2021 20:36:25 -0400
    Ready:          True
    Restart Count:  532
    Liveness:       tcp-socket :grpc-api delay=3s timeout=2s period=5s #success=1 #failure=3
    Readiness:      tcp-socket :grpc-api delay=3s timeout=2s period=5s #success=1 #failure=3
    Environment:
      DBCONFIG_USER:      <set to the key 'username' in secret 'mysql-secret'>               Optional: false
      DBCONFIG_PASSWORD:  <set to the key 'password' in secret 'mysql-secret'>               Optional: false
      MYSQL_DATABASE:     <set to the key 'mlmdDb' of config map 'pipeline-install-config'>  Optional: false
      MYSQL_HOST:         <set to the key 'dbHost' of config map 'pipeline-install-config'>  Optional: false
      MYSQL_PORT:         <set to the key 'dbPort' of config map 'pipeline-install-config'>  Optional: false
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4qdbv (ro)
  istio-proxy:
    Container ID:  docker://82b1c17880c0ec145d97cd26ca16526552122e7362e3fe4a0956555ebd63635b
    Image:         docker.io/istio/proxyv2:1.9.0
    Image ID:      docker-pullable://istio/proxyv2@sha256:286b821197d7a9233d1d889119f090cd9a9394468d3a312f66ea24f6e16b2294
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --serviceCluster
      metadata-grpc-deployment.kubeflow
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
      --concurrency
      2
    State:          Running
      Started:      Thu, 29 Jul 2021 18:16:00 -0400
    Last State:     Terminated
      Reason:       Error
      Exit Code:    255
      Started:      Tue, 20 Jul 2021 21:28:03 -0400
      Finished:     Thu, 29 Jul 2021 17:51:17 -0400
    Ready:          True
    Restart Count:  1
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      10m
      memory:   40Mi
    Readiness:  http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
    Environment:
      JWT_POLICY:                    third-party-jwt
      PILOT_CERT_PROVIDER:           istiod
      CA_ADDR:                       istiod.istio-system.svc:15012
      POD_NAME:                      metadata-grpc-deployment-c8f784fdf-hdvgr (v1:metadata.name)
      POD_NAMESPACE:                 kubeflow (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      CANONICAL_SERVICE:              (v1:metadata.labels['service.istio.io/canonical-name'])
      CANONICAL_REVISION:             (v1:metadata.labels['service.istio.io/canonical-revision'])
      PROXY_CONFIG:                  {}

      ISTIO_META_POD_PORTS:          [
                                         {"name":"grpc-api","containerPort":8080,"protocol":"TCP"}
                                     ]
      ISTIO_META_APP_CONTAINERS:     container
      ISTIO_META_CLUSTER_ID:         Kubernetes
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_META_WORKLOAD_NAME:      metadata-grpc-deployment
      ISTIO_META_OWNER:              kubernetes://apis/apps/v1/namespaces/kubeflow/deployments/metadata-grpc-deployment
      ISTIO_META_MESH_ID:            cluster.local
      TRUST_DOMAIN:                  cluster.local
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/istio from istiod-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4qdbv (ro)
      /var/run/secrets/tokens from istio-token (rw)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
      limits.cpu -> cpu-limit
      requests.cpu -> cpu-request
  istio-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  43200
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  kube-api-access-4qdbv:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason   Age                   From     Message
  ----    ------   ----                  ----     -------
  Normal  Pulled   31m (x350 over 7d2h)  kubelet  Container image "gcr.io/tfx-oss-public/ml_metadata_store_server:0.25.1" already present on machine
  Normal  Created  31m (x350 over 7d2h)  kubelet  Created container container
  Normal  Started  31m (x350 over 7d2h)  kubelet  Started container container
ReggieCarey commented 3 years ago

Both servers are part of the istio service mesh. In the past, the mysql process was implicated. That process remains up and stable with 0 restarts.

image

Both pods go to a state of Running then fail but they are not synchronized in this failure.

I should also add additional info that cache-server-* has restarted some 132 times in 15 days.

[mysql] 2021/08/05 17:34:16 packets.go:36: unexpected EOF
[mysql] 2021/08/05 18:34:16 packets.go:36: unexpected EOF
[mysql] 2021/08/05 19:34:16 packets.go:36: unexpected EOF
[mysql] 2021/08/05 20:34:16 packets.go:36: unexpected EOF
F0805 20:34:16.890285       1 error.go:325] driver: bad connection
goroutine 1 [running]:
github.com/golang/glog.stacks(0xc00043e300, 0xc000168000, 0x43, 0x96)
    /go/pkg/mod/github.com/golang/glog@v0.0.0-20160126235308-23def4e6c14b/glog.go:769 +0xb8
github.com/golang/glog.(*loggingT).output(0x28daaa0, 0xc000000003, 0xc000412000, 0x280d034, 0x8, 0x145, 0x0)
    /go/pkg/mod/github.com/golang/glog@v0.0.0-20160126235308-23def4e6c14b/glog.go:720 +0x372
github.com/golang/glog.(*loggingT).printf(0x28daaa0, 0x3, 0x199ee22, 0x2, 0xc0002f38d8, 0x1, 0x1)
    /go/pkg/mod/github.com/golang/glog@v0.0.0-20160126235308-23def4e6c14b/glog.go:655 +0x14b
github.com/golang/glog.Fatalf(...)
    /go/pkg/mod/github.com/golang/glog@v0.0.0-20160126235308-23def4e6c14b/glog.go:1148
github.com/kubeflow/pipelines/backend/src/common/util.TerminateIfError(...)
    /go/src/github.com/kubeflow/pipelines/backend/src/common/util/error.go:325
main.initMysql(0x7fffe3a0c170, 0x5, 0x7fffe3a0c180, 0x5, 0x7fffe3a0c190, 0x4, 0x7fffe3a0c19f, 0x7, 0x7fffe3a0c1b1, 0x4, ...)
    /go/src/github.com/kubeflow/pipelines/backend/src/cache/client_manager.go:157 +0x4f1
main.initDBClient(0x7fffe3a0c170, 0x5, 0x7fffe3a0c180, 0x5, 0x7fffe3a0c190, 0x4, 0x7fffe3a0c19f, 0x7, 0x7fffe3a0c1b1, 0x4, ...)
    /go/src/github.com/kubeflow/pipelines/backend/src/cache/client_manager.go:71 +0x681
main.(*ClientManager).init(0xc0002f3e98, 0x7fffe3a0c170, 0x5, 0x7fffe3a0c180, 0x5, 0x7fffe3a0c190, 0x4, 0x7fffe3a0c19f, 0x7, 0x7fffe3a0c1b1, ...)
    /go/src/github.com/kubeflow/pipelines/backend/src/cache/client_manager.go:57 +0x80
main.NewClientManager(...)
    /go/src/github.com/kubeflow/pipelines/backend/src/cache/client_manager.go:169
main.main()
    /go/src/github.com/kubeflow/pipelines/backend/src/cache/main.go:77 +0x66e

Maybe this is an issue with istio and service mesh configuration on bare metal.

In past incarnations of this bug, there was a repair offered up to establish PeerAuthentication - this resource does not exist in my cluster. The old suggestion was to apply:

apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
  name: "default"
spec:
  mtls:
    mode: STRICT

I do not know if this is still applicable in the the Istio 1.9.0 world.

ReggieCarey commented 3 years ago

Any progress or ideas or things to try? KFP probably represents one of the most beneficial parts of Kubeflow 1.3.0 to us. As of now all I have is Jupiter Notebooks with a Kubeflow Dashboard wrapper. I can't use KFP at all.

ReggieCarey commented 3 years ago

It's been 17 days and I have not heard any movement on this bug. I really want to get this resolved. I have switched my ISTIO service mesh to require mTLS (STRICT) everywhere. I have verified that the processes listed here as well as the mysql process are all within the service mesh - as evidenced by istio-proxy and istio-init being injected into the three pods.

This really does appear to be a problem with access to the mysql store.

My next experiments will be to stand up an ubuntu container with mysql client in the kubeflow namespace. From there I hope to be able to validate connectivity.

I can see two outcomes:

1) Connectivity works 2) Connectivity fails

In both cases the next step is still : What do I do given this additional information?

As an FYI, the mysql process' istio-proxy shows the following in the logs:

2021-08-26T15:15:46.314059Z info    xdsproxy    connected to upstream XDS server: istiod.istio-system.svc:15012
2021-08-26T15:44:53.280933Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 0, 
2021-08-26T15:44:53.515221Z info    xdsproxy    connected to upstream XDS server: istiod.istio-system.svc:15012
2021-08-26T16:12:54.444630Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 0, 
2021-08-26T16:12:54.806062Z info    xdsproxy    connected to upstream XDS server: istiod.istio-system.svc:15012
2021-08-26T16:45:50.518921Z warning envoy config    StreamAggregatedResources gRPC config stream closed: 0, 
2021-08-26T16:45:50.786400Z info    xdsproxy    connected to upstream XDS server: istiod.istio-system.svc:15012
ReggieCarey commented 3 years ago

UPDATE:

Was able to get to partial success:

What I did was to edit KubeFlow/v1.3.0/manifests/apps/pipeline/upstream/third-party/mysql/base/mysql-deployment.yaml Add in:

spec:
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"

(NOTE: I also changed to use image: mysql:8 - but I don't think this is the issue)

And then, I underplayed and redeployed the KFP - I know I could have just applied the changes.

$ kustomize build apps/pipeline/upstream/env/platform-agnostic-multi-user | kubectl delete -f -
$ kustomize build apps/pipeline/upstream/env/platform-agnostic-multi-user | kubectl apply -f -

The downside is that metadata-grpc-deployment and metadata-writer now fail.

for metadata-grpc-deployment, the log reads:

2021-08-26 21:43:31.894833: F ml_metadata/metadata_store/metadata_store_server_main.cc:220] Non-OK-status: status status: Internal: mysql_real_connect failed: errno: 2059, error: Plugin caching_sha2_password could not be loaded: lib/mariadb/plugin/caching_sha2_password.so: cannot open shared object file: No such file or directoryMetadataStore cannot be created with the given connection config.

for metadata-writer, the logs read:

Failed to access the Metadata store. Exception: "no healthy upstream"

Next I tried to use "platform-agnostic-multi-user-legacy"...

$ kustomize build apps/pipeline/upstream/env/platform-agnostic-multi-user-legacy | kubectl apply -f -

And all processes are now running - except this now shows up:

image

Again: Any suggestions and assistance is highly appreciated.

midhun1998 commented 3 years ago

I can confirm that this is issue is seen with KF 1.3.1 too. The bug is very annoying as KFP remains inaccessible.

Bobgy commented 3 years ago

Ohh, sorry this fell through the cracks. Let me take a look tmr.

Bobgy commented 3 years ago

/assign @zijianjoy This looks like an Istio compatibility issue.

Bobgy commented 3 years ago

When you disable sidecar injection, also find all the destination rules and delete the destination rule for mysql. Otherwise, all other clients will fail to access MySQL assuming mTLS is turned on.

Edit: this is a workaround by pulling MySQL out of the mesh.

Bobgy commented 3 years ago

If you want MySQL in the mesh, you need to check Istio documentation for troubleshooting instructions. I agree Istio is very hard to troubleshoot, got the same frustration when configuring these up.

shawnzhu commented 2 years ago

I had the similar problem on metadatawriter, after the pod of deployment metadata-grpc-deployment has envoy sidecar, the metadatawriter stopped the crashloopbackoff.

so please check any component in kubeflow ns has required envoy sidecar, especially when you enforced the strict mTLS via peerAuthentication.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

zengqingfu1442 commented 1 year ago

I also encountered this issue. My env: k8s: 1.19.16 kubeflow manifests: v1.5.1 tag kustomize: 3.2.0

install command:

while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done

metadata-grpc-deployment-6f6f7776c5-pq6hq log:

WARNING: Logging before InitGoogleLogging() is written to STDERR
F0321 09:45:37.213575     1 metadata_store_server_main.cc:236] Check failed: absl::OkStatus() == status (OK vs. INTERNAL: mysql_real_connect failed: errno: , error:  [mysql-error-info='']) MetadataStore cannot be created with the given connection config.
*** Check failure stack trace: ***

mysql-f7b9b7dd4-5m4tb logs:

2023-03-21 09:47:24+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 5.7.33-1debian10 started.
2023-03-21 09:47:27+00:00 [ERROR] [Entrypoint]: mysqld failed while attempting to check config
        command was: mysqld --ignore-db-dir=lost+found --datadir /var/lib/mysql --verbose --help --log-bin-index=/tmp/tmp.fbqLrtywDf

metadata-writer-d7ff8d4bc-8thjn logs:

Failed to access the Metadata store. Exception: "no healthy upstream"
Failed to access the Metadata store. Exception: "no healthy upstream"
Failed to access the Metadata store. Exception: "no healthy upstream"
Traceback (most recent call last):
  File "/kfp/metadata_writer/metadata_writer.py", line 63, in <module>
    mlmd_store = connect_to_mlmd()
  File "/kfp/metadata_writer/metadata_helpers.py", line 62, in connect_to_mlmd
    raise RuntimeError('Could not connect to the Metadata store.')
RuntimeError: Could not connect to the Metadata store.

And I tried k8s 1.21.2, still has the same problem.

millerhooks commented 1 year ago

NOTE: Everything I wrote below is totally not useful to this thread about cloud deployments. I just picked this issue out of the five I was looking at. I seriously thought it was contextually appropriate but it is not. The below information may be useful to you, but it was put here on accident essentially. I also think it's useful info and I don't want to just delete it. I will move what I've said below into it's own issue tomorrow.


I have this problem on some systems and believe it is because of the maxuser{instance,watches} settings. Which can be fixed by setting this.

sudo sysctl fs.inotify.max_user_instances=1280 
sudo sysctl fs.inotify.max_user_watches=655360

I have found that it should be set before I deploy the cluster and kubeflow but if a cluster is running you can kill the pods that get hung but in my experience that has been a waste of time compared to just starting over from a clean slate.

I also had this issue on a machine that was running off of an SSD booted over USB-C and setting the maxuser* didn't fix it. I still don't really "know" why I couldn't get that system to work but basically I think this problem amounts to any situation where the disk throughput gets maxed out.

Kubeflow is a massive beast that does a LOT of stuff and on many machines you will hit limits where it can't function and it usually looks like what you've posted here. I know that's not the most technical solution to an open issue, but the problem is kind of squishy and I think can occur for multiple reasons.

millerhooks commented 1 year ago

I believe this problem is consistently produced by following the Charmed Kubeflow documentation on a fresh Ubuntu 22.04 or 20.04 system. I would be surprised if anyone can follow the current installation instructions with any success as the range of systems I'm testing on right now is some pretty high end server stuff and a couple consumer sort of desktop machines.

Not that this likely matters but I am doing GPU stuff so I've enabled GPU with microk8s. I can easily pass validation and push up a container to do nvidia-smi so i don't think that would be causing the issue. I mention this because one of my systems had an infoROM issue on one GPU for a while and if I didn't disable it containers would fail to launch and the failures would look a lot like this issue sometimes (depending on a lot of factors). If anyone has an infoROM issue on a GPU, when you search for that Nvidia's official folks tell you it's unserviceable which is untrue. You just have to use nvflash a windows tool and you can't download it directly from nvidia for some reason it's only available from some websites that IMO look pretty sketchy. Use at your own risk.

Since I do have a running microk8s 1.7 kubeflow setup, I think I should be able to solve this and commit back but I'm pretty certain the current Charmed Kubeflow setup instructions are very broken.

millerhooks commented 1 year ago

The numbers I used above were not sufficient to solve the too many open files problem. I just ripped them from this issue and at one point a couple months ago, they did work for me. I would be very interested to know what changed in the packages that made the instances/watches requirements increase but I'm very much too busy to formally care. Here's the issue I got these numbers from.

https://github.com/kubeflow/manifests/issues/2087

I have no idea how to calculate what to raise the numbers to or the consequences of raising the limit too high. I just upped the largest digit by one.

sudo sysctl fs.inotify.max_user_instances=2280 
sudo sysctl fs.inotify.max_user_watches=755360

And I totally forgot about this other issue I've had. GVFS backends! There is a client that runs on a default Ubuntu Desktop install for each user that's logged in. It is monitoring the disk and when you fire up Kubeflow it flips out and takes up 120-200% of the cpu for each user. I do not need this tool for my deployments. I'm unsure if this is a problem with a barebones server install but this is critical to solving the problem of launching kubeflow on a fresh installed Ubuntu Desktop 20.04/22.04 system.

sudo apt-get remove gvfs-backends -y

Hooray. I think this solves a reproducibility problem I've had for over a month. I haven't quite figured out if there are licensing issues with distributing this, but I've built an ubuntu LiveCD with CUDA, docker, and kubeflow (microk8s). I'll ping folks on slack and see if there's any interest in it. I've got a solid pipeline for doing the CI/CD for releases and man, the little things to get going are really a big barrier.

It is very possible that the maxuser* doesn't need to be raised so high if the gvfs-backends are removed from the start but I will not be trying to figure that out explicitly in the near term.

millerhooks commented 1 year ago

Also tensorboard-controller was failing because the ingress wasn't coming up correctly. That has been covered in a ton of issues.

https://discourse.charmhub.io/t/charmed-kubeflow-upgrade-error/7007/7

juju run --unit istio-pilot/0 -- "export JUJU_DISPATCH_PATH=hooks/config-changed; ./dispatch"

Solves that. I've tested this on multiple machines now and raising the maxuser* limits, uninstalling gvfs-backends, and fixing the ingress with the above command solves all of the problems consistently. I'm working on a Zero to Kubeflow tutorial, but I'll submit a PR for the Charmed Kubeflow instructions that covers these things if someone can point me at where to submit it.

I am realizing after some review that in this situation and the other issues I've read relating to a similar failure, most people are running in the cloud and not on bare metal. I do think the gist of what I've pointed out is still valid though but on this thread what I've posted is just not directly useful. It's squishy. These problems usually are related to disk throughput but that gets weird to sort in the cloud. Anyway.... All of what I said above has nothing to do with this issue I am realizing. Sorry for posting all this in the wrong place. I don't know where else to put all this info, so I'm going to leave it here for now.

I'll collect it and put it into it's own issue tomorrow and remove it from this thread. Sorry if this confused anyone. It's been a long day.

thesuperzapper commented 1 year ago

If you are using an external MySQL database (especially if its MySQL 8.0), you are likely experiencing this issue around support for caching_sha2_password authentication, see here for more:

FYI, Kubeflow Pipelines itself fixed MySQL 8.0 and caching_sha2_password support in 2.0.0-alpha.7 (ships with Kubeflow 1.7), but there is still an issue with the upstream ml-metadata project:

JuanPabloSebey commented 1 year ago

I had the same problem. For me it was a cilium networking provider compatibility issue. I had to move to kubenet and it worked.

skvishy27 commented 1 year ago

Can you elaborate the compatibility issue please? As I'm using cilium as well.

taenzeyang commented 9 months ago

I had the same problem. For me it was a cilium networking provider compatibility issue. I had to move to kubenet and it worked.

Maybe you enable Cilium’s kubeProxyReplacement feature, you can disable this feature or set --config bpf-lb-sock-hostns-only=true ref : https://docs.cilium.io/en/latest/network/servicemesh/istio/#cilium-configuration

rimolive commented 6 months ago

This issue looks like there are more than problems and resolutions so it's better to close it and if there are still pending issues let's open a new one to have a cleaner discussion thread.

/close

google-oss-prow[bot] commented 6 months ago

@rimolive: Closing this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/6121#issuecomment-1991141422): >This issue looks like there are more than problems and resolutions so it's better to close it and if there are still pending issues let's open a new one to have a cleaner discussion thread. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.