Closed Sundragon1993 closed 9 months ago
try increasing the memory limit. in my case I increased it to 6gb (more is better), and it worked
for me it was a cilium networking provider compatibility issue. I had to move to kubenet and it worked.
try increasing the memory limit. in my case I increased it to 6gb (more is better), and it worked
How do we increase memory as the deployment does not have the resources specified
/close
@juliusvonkohout: Closing this issue.
Please help, My env: k8s: 1.24.12 kf: 1.7.0 cilium: v1.10.0-rc0
auth dex-8579644bbb-hqf2c 1/1 Running 0 17m cert-manager cert-manager-7475574-tgl9f 1/1 Running 0 19m cert-manager cert-manager-cainjector-d5dc6cd7f-xcgzs 1/1 Running 0 19m cert-manager cert-manager-webhook-6868bd8b7-xknbj 1/1 Running 0 19m gvh453 ml-pipeline-ui-artifact-5b4465bcb7-vphbv 2/2 Running 0 12m gvh453 ml-pipeline-visualizationserver-5568776585-xhl4g 2/2 Running 0 12m istio-system authservice-0 1/1 Running 0 17m istio-system cluster-local-gateway-757849494c-j48d7 1/1 Running 0 17m istio-system istio-ingressgateway-cf7bd56f-x9bsz 1/1 Running 0 17m istio-system istiod-586fcd6677-sbcjx 1/1 Running 0 17m knative-eventing eventing-controller-5b7bfc8895-cjdd6 1/1 Running 0 17m knative-eventing eventing-webhook-5896d776b-fttzv 1/1 Running 0 17m knative-serving activator-5bbf976855-r5t9m 2/2 Running 0 16m knative-serving autoscaler-5cc8b77f4d-746zc 2/2 Running 0 16m knative-serving controller-657b7bb75c-bnktt 2/2 Running 0 16m knative-serving domain-mapping-6c4878cc54-sqdzv 2/2 Running 0 16m knative-serving domainmapping-webhook-f76bcd89f-gd2q8 2/2 Running 0 16m knative-serving net-istio-controller-6cb499fccb-vpn6p 2/2 Running 0 16m knative-serving net-istio-webhook-6858cd8998-ffrxc 2/2 Running 0 16m knative-serving webhook-76f9bc6584-7rpkg 2/2 Running 0 16m kube-system cilium-operator-7c748dbd5d-5cl6x 1/1 Running 0 20m kube-system cilium-rl6tq 1/1 Running 0 20m kube-system coredns-57575c5f89-fqcn8 1/1 Running 0 22m kube-system coredns-57575c5f89-xndm2 1/1 Running 0 22m kube-system etcd-e2m122-pc 1/1 Running 0 22m kube-system kube-apiserver-e2m122-pc 1/1 Running 0 22m kube-system kube-controller-manager-e2m122-pc 1/1 Running 0 22m kube-system kube-proxy-85gmm 1/1 Running 0 22m kube-system kube-scheduler-e2m122-pc 1/1 Running 0 22m kube-system nvidia-device-plugin-daemonset-fbmrn 1/1 Running 0 22m kubeflow-user-example-com ml-pipeline-ui-artifact-5b4465bcb7-p8rzk 2/2 Running 0 15m kubeflow-user-example-com ml-pipeline-visualizationserver-5568776585-4svcc 2/2 Running 0 15m kubeflow admission-webhook-deployment-cb6db9648-brdcl 1/1 Running 0 16m kubeflow cache-server-86584db5d8-sbkz6 2/2 Running 0 16m kubeflow centraldashboard-dd9c778b6-dxw2j 2/2 Running 0 16m kubeflow jupyter-web-app-deployment-cc9cbc696-4s48h 2/2 Running 0 16m kubeflow katib-controller-86d4d45478-7tw7l 1/1 Running 0 16m kubeflow katib-db-manager-689cdf95c6-5xr8n 1/1 Running 0 16m kubeflow katib-mysql-5bc98798b4-xkhtl 1/1 Running 0 16m kubeflow katib-ui-b5d5cf978-8l9bk 2/2 Running 1 (16m ago) 16m kubeflow kserve-controller-manager-7879bf6dd7-9f2qr 2/2 Running 0 16m kubeflow kserve-models-web-app-f9c576856-kxd2r 2/2 Running 0 16m kubeflow kubeflow-pipelines-profile-controller-5dd5468d9b-zwcdq 1/1 Running 0 16m kubeflow metacontroller-0 1/1 Running 0 16m kubeflow metadata-envoy-deployment-76c587bd47-29hlm 1/1 Running 0 16m kubeflow metadata-grpc-deployment-5c8599b99c-g6qmq 1/2 CrashLoopBackOff 8 (117s ago) 16m kubeflow metadata-writer-6c576c94b8-7qtb4 2/2 Running 6 (3m6s ago) 16m kubeflow minio-6d6d45469f-8gq2h 2/2 Running 0 16m kubeflow ml-pipeline-77d4d9974b-9dtlb 1/2 CrashLoopBackOff 7 (2m6s ago) 16m kubeflow ml-pipeline-persistenceagent-75bccd8b64-tjf4n 2/2 Running 0 16m kubeflow ml-pipeline-scheduledworkflow-6dfcd5dd89-q597m 2/2 Running 0 16m kubeflow ml-pipeline-ui-5ddb5b76d8-rrxmb 2/2 Running 0 16m kubeflow ml-pipeline-viewer-crd-86cbc45d9b-dzlf6 2/2 Running 1 (16m ago) 16m kubeflow ml-pipeline-visualizationserver-5577c64b45-h7864 2/2 Running 0 16m kubeflow mysql-6878bbff69-xdz42 2/2 Running 0 16m kubeflow notebook-controller-deployment-699589b4f9-67qk8 2/2 Running 1 (16m ago) 16m kubeflow profiles-deployment-74f656c59f-ndqvp 3/3 Running 1 (16m ago) 16m kubeflow tensorboard-controller-deployment-5655cc9dbb-5r22h 3/3 Running 1 (16m ago) 16m kubeflow tensorboards-web-app-deployment-8474fd9569-5g9v7 2/2 Running 0 16m kubeflow training-operator-7f768bbbdb-ww55q 1/1 Running 0 16m kubeflow volumes-web-app-deployment-7b998df674-mwqdz 2/2 Running 0 16m kubeflow workflow-controller-78c979dc75-l2rht 2/2 Running 1 (16m ago) 16m local-path-storage local-path-provisioner-8f77648b6-db2r7 1/1 Running 0 22 ` ========++++++++++================================================================== Name: metadata-grpc-deployment-5c8599b99c-g6qmq Namespace: kubeflow Priority: 0 Node: e2m122-pc/192.168.0.80 Start Time: Thu, 06 Apr 2023 18:01:26 +0800 Labels: application-crd-id=kubeflow-pipelines component=metadata-grpc-server pod-template-hash=5c8599b99c security.istio.io/tlsMode=istio service.istio.io/canonical-name=metadata-grpc-deployment service.istio.io/canonical-revision=latest Annotations: kubectl.kubernetes.io/default-container: container kubectl.kubernetes.io/default-logs-container: container prometheus.io/path: /stats/prometheus prometheus.io/port: 15020 prometheus.io/scrape: true sidecar.istio.io/status: {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["workload-socket","credential-socket","workload-certs","istio-env... Status: Running IP: 10.0.0.230 IPs: IP: 10.0.0.230 Controlled By: ReplicaSet/metadata-grpc-deployment-5c8599b99c Init Containers: istio-init: Container ID: containerd://98176fc0f962dadb3e2e4c2ba2042e1871042c762e3043d7f62b18107f794b1f Image: docker.io/istio/proxyv2:1.16.0 Image ID: docker.io/istio/proxyv2@sha256:f6f97fa4fb77a3cbe1e3eca0fa46bd462ad6b284c129cf57bf91575c4fb50cf9 Port:
Host Port:
Args:
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 06 Apr 2023 18:01:54 +0800
Finished: Thu, 06 Apr 2023 18:01:54 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 10m
memory: 40Mi
Environment:
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5r2rx (ro)
Containers:
container:
Container ID: containerd://7cc284f4f5373b167047d0087ee8c0931c5290efe92b30426b4f82458854064e
Image: gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0
Image ID: gcr.io/tfx-oss-public/ml_metadata_store_server@sha256:db8691752b4cd02658e4bb28b73d34a18ba71f49d6cc124a47c0c5001f8c0f83
Port: 8080/TCP
Host Port: 0/TCP
Command:
/bin/metadata_store_server
Args:
--grpc_port=8080
--mysql_config_database=$(MYSQL_DATABASE)
--mysql_config_host=$(MYSQL_HOST)
--mysql_config_port=$(MYSQL_PORT)
--mysql_config_user=$(DBCONFIG_USER)
--mysql_config_password=$(DBCONFIG_PASSWORD)
--enable_database_upgrade=true
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 137
Started: Thu, 06 Apr 2023 18:09:05 +0800
Finished: Thu, 06 Apr 2023 18:09:50 +0800
Ready: False
Restart Count: 6
Liveness: http-get http://:15020/app-health/container/livez delay=3s timeout=2s period=5s #success=1 #failure=3
Readiness: http-get http://:15020/app-health/container/readyz delay=3s timeout=2s period=5s #success=1 #failure=3
Environment:
DBCONFIG_USER: <set to the key 'username' in secret 'mysql-secret'> Optional: false
DBCONFIG_PASSWORD: <set to the key 'password' in secret 'mysql-secret'> Optional: false
MYSQL_DATABASE: <set to the key 'mlmdDb' of config map 'pipeline-install-config'> Optional: false
MYSQL_HOST: <set to the key 'dbHost' of config map 'pipeline-install-config'> Optional: false
MYSQL_PORT: <set to the key 'dbPort' of config map 'pipeline-install-config'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-5r2rx (ro)
istio-proxy:
Container ID: containerd://0dabec7c855d12867e76dcfad08f863a1e755624430cfc145fcfe272ad4c97ff
Image: docker.io/istio/proxyv2:1.16.0
Image ID: docker.io/istio/proxyv2@sha256:f6f97fa4fb77a3cbe1e3eca0fa46bd462ad6b284c129cf57bf91575c4fb50cf9
Port: 15090/TCP
Host Port: 0/TCP
Args:
proxy
sidecar
--domain
$(POD_NAMESPACE).svc.cluster.local
--proxyLogLevel=warning
--proxyComponentLogLevel=misc:error
--log_output_level=default:info
--concurrency
2
State: Running
Started: Thu, 06 Apr 2023 18:01:55 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 10m
memory: 40Mi
Readiness: http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
Environment:
JWT_POLICY: third-party-jwt
PILOT_CERT_PROVIDER: istiod
CA_ADDR: istiod.istio-system.svc:15012
POD_NAME: metadata-grpc-deployment-5c8599b99c-g6qmq (v1:metadata.name)
POD_NAMESPACE: kubeflow (v1:metadata.namespace)
INSTANCE_IP: (v1:status.podIP)
SERVICE_ACCOUNT: (v1:spec.serviceAccountName)
HOST_IP: (v1:status.hostIP)
PROXY_CONFIG: {}
Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: workload-socket: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium:
credential-socket:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
workload-certs:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit:
istio-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
istio-podinfo:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.labels -> labels
metadata.annotations -> annotations
istio-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 43200
istiod-ca-cert:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: istio-ca-root-cert
Optional: false
kube-api-access-5r2rx:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: Burstable
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
SizeLimit:
SizeLimit:
SizeLimit:
SizeLimit:
Normal Scheduled 10m default-scheduler Successfully assigned kubeflow/metadata-grpc-deployment-5c8599b99c-g6qmq to e2m122-pc Normal Pulled 10m kubelet Container image "docker.io/istio/proxyv2:1.16.0" already present on machine Normal Created 10m kubelet Created container istio-init Normal Started 10m kubelet Started container istio-init Normal Pulled 10m kubelet Container image "docker.io/istio/proxyv2:1.16.0" already present on machine Normal Created 10m kubelet Created container istio-proxy Normal Started 10m kubelet Started container istio-proxy Normal Started 9m59s (x3 over 10m) kubelet Started container container Normal Pulled 9m32s (x4 over 10m) kubelet Container image "gcr.io/tfx-oss-public/ml_metadata_store_server:1.5.0" already present on machine Normal Created 9m32s (x4 over 10m) kubelet Created container container Warning BackOff 1s (x42 over 10m) kubelet Back-off restarting failed container ==================================ML pipeline==========================================
Name: ml-pipeline-77d4d9974b-9dtlb Namespace: kubeflow Priority: 0 Node: e2m122-pc/192.168.0.80 Start Time: Thu, 06 Apr 2023 18:01:23 +0800 Labels: app=ml-pipeline app.kubernetes.io/component=ml-pipeline app.kubernetes.io/name=kubeflow-pipelines application-crd-id=kubeflow-pipelines pod-template-hash=77d4d9974b security.istio.io/tlsMode=istio service.istio.io/canonical-name=kubeflow-pipelines service.istio.io/canonical-revision=latest Annotations: cluster-autoscaler.kubernetes.io/safe-to-evict: true kubectl.kubernetes.io/default-container: ml-pipeline-api-server kubectl.kubernetes.io/default-logs-container: ml-pipeline-api-server prometheus.io/path: /stats/prometheus prometheus.io/port: 15020 prometheus.io/scrape: true sidecar.istio.io/status: {"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["workload-socket","credential-socket","workload-certs","istio-env... Status: Running IP: 10.0.0.145 IPs: IP: 10.0.0.145 Controlled By: ReplicaSet/ml-pipeline-77d4d9974b Init Containers: istio-init: Container ID: containerd://4ef4038065e34337f3a298dd99f5706dc3d542bac13a81a66df32acb62f5e25b Image: docker.io/istio/proxyv2:1.16.0 Image ID: docker.io/istio/proxyv2@sha256:f6f97fa4fb77a3cbe1e3eca0fa46bd462ad6b284c129cf57bf91575c4fb50cf9 Port:
Host Port:
Args:
istio-iptables
-p
15001
-z
15006
-u
1337
-m
REDIRECT
-i
*
-x
Containers: ml-pipeline-api-server: Container ID: containerd://9dd354f638060b788593d8471e045286acee612595deb280a1fca7bdb658f9fe Image: gcr.io/ml-pipeline/api-server:2.0.0-alpha.7 Image ID: gcr.io/ml-pipeline/api-server@sha256:3b75be9180bad7ac56017a554a4a9402e57b333a48e8bd83c8614f69babee032 Ports: 8888/TCP, 8887/TCP Host Ports: 0/TCP, 0/TCP State: Running Started: Thu, 06 Apr 2023 18:10:26 +0800 Last State: Terminated Reason: Error Exit Code: 137 Started: Thu, 06 Apr 2023 18:08:56 +0800 Finished: Thu, 06 Apr 2023 18:10:26 +0800 Ready: False Restart Count: 6 Requests: cpu: 250m memory: 500Mi Liveness: exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3 Readiness: exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=3s timeout=2s period=5s #success=1 #failure=3 Startup: exec [wget -q -S -O - http://localhost:8888/apis/v1beta1/healthz] delay=0s timeout=2s period=5s #success=1 #failure=12 Environment Variables from: pipeline-api-server-config-dc9hkg52h6 ConfigMap Optional: false Environment: KUBEFLOW_USERID_HEADER: kubeflow-userid KUBEFLOW_USERID_PREFIX:
AUTO_UPDATE_PIPELINE_DEFAULT_VERSION: <set to the key 'autoUpdatePipelineDefaultVersion' of config map 'pipeline-install-config'> Optional: false POD_NAMESPACE: kubeflow (v1:metadata.namespace) OBJECTSTORECONFIG_SECURE: false OBJECTSTORECONFIG_BUCKETNAME: <set to the key 'bucketName' of config map 'pipeline-install-config'> Optional: false DBCONFIG_USER: <set to the key 'username' in secret 'mysql-secret'> Optional: false DBCONFIG_PASSWORD: <set to the key 'password' in secret 'mysql-secret'> Optional: false DBCONFIG_DBNAME: <set to the key 'pipelineDb' of config map 'pipeline-install-config'> Optional: false DBCONFIG_HOST: <set to the key 'dbHost' of config map 'pipeline-install-config'> Optional: false DBCONFIG_PORT: <set to the key 'dbPort' of config map 'pipeline-install-config'> Optional: false DBCONFIG_CONMAXLIFETIME: <set to the key 'ConMaxLifeTime' of config map 'pipeline-install-config'> Optional: false OBJECTSTORECONFIG_ACCESSKEY: <set to the key 'accesskey' in secret 'mlpipeline-minio-artifact'> Optional: false OBJECTSTORECONFIG_SECRETACCESSKEY: <set to the key 'secretkey' in secret 'mlpipeline-minio-artifact'> Optional: false Mounts: /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-zjklj (ro) istio-proxy: Container ID: containerd://55a8a7ae8c8817404e9b07cf966efa767f76c1f0c67606aca7facdab7af430aa Image: docker.io/istio/proxyv2:1.16.0 Image ID: docker.io/istio/proxyv2@sha256:f6f97fa4fb77a3cbe1e3eca0fa46bd462ad6b284c129cf57bf91575c4fb50cf9 Port: 15090/TCP Host Port: 0/TCP Args: proxy sidecar --domain $(POD_NAMESPACE).svc.cluster.local --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --log_output_level=default:info --concurrency 2 State: Running Started: Thu, 06 Apr 2023 18:01:27 +0800 Ready: True Restart Count: 0 Limits: cpu: 2 memory: 1Gi Requests: cpu: 10m memory: 40Mi Readiness: http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30 Environment: JWT_POLICY: third-party-jwt PILOT_CERT_PROVIDER: istiod CA_ADDR: istiod.istio-system.svc:15012 POD_NAME: ml-pipeline-77d4d9974b-9dtlb (v1:metadata.name) POD_NAMESPACE: kubeflow (v1:metadata.namespace) INSTANCE_IP: (v1:status.podIP) SERVICE_ACCOUNT: (v1:spec.serviceAccountName) HOST_IP: (v1:status.hostIP) PROXY_CONFIG: {}
Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: workload-socket: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium:
credential-socket:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
workload-certs:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit:
istio-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
istio-podinfo:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.labels -> labels
metadata.annotations -> annotations
istio-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 43200
istiod-ca-cert:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: istio-ca-root-cert
Optional: false
kube-api-access-zjklj:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional:
DownwardAPI: true
QoS Class: Burstable
Node-Selectors:
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
SizeLimit:
SizeLimit:
SizeLimit:
SizeLimit:
Normal Scheduled 10m default-scheduler Successfully assigned kubeflow/ml-pipeline-77d4d9974b-9dtlb to e2m122-pc Warning FailedMount 10m kubelet MountVolume.SetUp failed for volume "istiod-ca-cert" : failed to sync configmap cache: timed out waiting for the condition Normal Pulled 10m kubelet Container image "docker.io/istio/proxyv2:1.16.0" already present on machine Normal Created 10m kubelet Created container istio-init Normal Started 10m kubelet Started container istio-init Normal Created 10m kubelet Created container istio-proxy Normal Started 10m kubelet Started container ml-pipeline-api-server Normal Pulled 10m kubelet Container image "docker.io/istio/proxyv2:1.16.0" already present on machine Normal Started 10m kubelet Started container istio-proxy Normal Killing 9m6s kubelet Container ml-pipeline-api-server failed startup probe, will be restarted Normal Created 8m36s (x2 over 10m) kubelet Created container ml-pipeline-api-server Normal Pulled 8m36s (x2 over 10m) kubelet Container image "gcr.io/ml-pipeline/api-server:2.0.0-alpha.7" already present on machine Warning Unhealthy 6s (x84 over 10m) kubelet Startup probe failed:
=============================MetaWriter=================
Events: Type Reason Age From Message
Normal Scheduled 39m default-scheduler Successfully assigned kubeflow/metadata-writer-6c576c94b8-7qtb4 to e2m122-pc Warning FailedCreatePodSandBox 38m kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "2f11a598c5b23a60ceebd496250ae8537fe0283678ceeb56b37951bc64afdb4f": plugin type="cilium-cni" name="cilium" failed (add): Unable to create endpoint: response status code does not match any response statuses defined for this endpoint in the swagger spec (status 429): {} Normal Pulled 38m kubelet Container image "docker.io/istio/proxyv2:1.16.0" already present on machine Normal Created 38m kubelet Created container istio-init Normal Started 38m kubelet Started container istio-init Normal Pulled 38m kubelet Container image "docker.io/istio/proxyv2:1.16.0" already present on machine Normal Created 38m kubelet Created container istio-proxy Normal Started 38m kubelet Started container istio-proxy Normal Pulled 32m (x4 over 38m) kubelet Container image "gcr.io/ml-pipeline/metadata-writer:2.0.0-alpha.7" already present on machine Normal Created 32m (x4 over 38m) kubelet Created container main Normal Started 32m (x4 over 38m) kubelet Started container main Warning BackOff 3m39s (x94 over 35m) kubelet Back-off restarting failed container