Closed ReggieCarey closed 6 months ago
Hi @ReggieCarey!
Can you give output of kubectl logs ml-pipeline-9b68d49cb-x67mp ml-pipeline-api-server --previous
? that should include logs from last failed container.
For metadata-writer, what is status of metadata-grpc-server
? metadata-writer fails to connect to the store
Thanks Yuan Gong,
As per request:
Thu Aug 05 15:43:51 $ kubectl logs ml-pipeline-9b68d49cb-6z2tq ml-pipeline-api-server --previous
I0806 00:55:18.818889 7 client_manager.go:154] Initializing client manager
I0806 00:55:18.818963 7 config.go:57] Config DBConfig.ExtraParams not specified, skipping
Sadly, that's it.
Reggie
For metadata-writer issue: The status of metadata-grpc-deployment is stable(ish). It has restarted 533 times.
Here is output from "describe", the logs for "container" are empty
Thu Aug 05 15:42:03 $ kubectl describe pod metadata-grpc-deployment-c8f784fdf-hdvgr
metacontroller-0
metadata-envoy-deployment-6756c995c9-fl7gb
metadata-grpc-deployment-c8f784fdf-hdvgr
metadata-writer-6bf5cfd7d8-v7fzb
Thu Aug 05 15:42:03 $ kubectl describe pod metadata-grpc-deployment-c8f784fdf-hdvgr
Name: metadata-grpc-deployment-c8f784fdf-hdvgr
Namespace: kubeflow
Priority: 0
Node: cpu-compute-05/10.164.208.183
Start Time: Tue, 20 Jul 2021 21:27:57 -0400
Labels: application-crd-id=kubeflow-pipelines
component=metadata-grpc-server
istio.io/rev=default
pod-template-hash=c8f784fdf
security.istio.io/tlsMode=istio
service.istio.io/canonical-name=metadata-grpc-deployment
service.istio.io/canonical-revision=latest
Annotations: kubectl.kubernetes.io/default-logs-container: container
prometheus.io/path: /stats/prometheus
prometheus.io/port: 15020
prometheus.io/scrape: true
sidecar.istio.io/status:
{"initContainers":["istio-init"],"containers":["istio-proxy"],"volumes":["istio-envoy","istio-data","istio-podinfo","istio-token","istiod-...
Status: Running
IP: 10.0.0.135
IPs:
IP: 10.0.0.135
Controlled By: ReplicaSet/metadata-grpc-deployment-c8f784fdf
Init Containers:
istio-init:
Container ID: docker://b6c7ccd562a209578a835ec5f5a0b3799e150f2c3be4e428be03788591c707e7
Image: docker.io/istio/proxyv2:1.9.0
Image ID: docker-pullable://istio/proxyv2@sha256:286b821197d7a9233d1d889119f090cd9a9394468d3a312f66ea24f6e16b2294
Port: <none>
Host Port: <none>
Args:
istio-iptables
-p
15001
-z
15006
-u
1337
-m
REDIRECT
-i
*
-x
-b
*
-d
15090,15021,15020
State: Terminated
Reason: Completed
Exit Code: 0
Started: Thu, 29 Jul 2021 18:15:54 -0400
Finished: Thu, 29 Jul 2021 18:15:54 -0400
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 10m
memory: 40Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4qdbv (ro)
Containers:
container:
Container ID: docker://b5c4b93b8034fdfed2181814fd9b37ff9fe2b154cf3163a988bc4ab4f29e202c
Image: gcr.io/tfx-oss-public/ml_metadata_store_server:0.25.1
Image ID: docker-pullable://gcr.io/tfx-oss-public/ml_metadata_store_server@sha256:01691247116fe048e0761ae8033efaad3ddd82438d0198f2235afb37c1757d48
Port: 8080/TCP
Host Port: 0/TCP
Command:
/bin/metadata_store_server
Args:
--grpc_port=8080
--mysql_config_database=$(MYSQL_DATABASE)
--mysql_config_host=$(MYSQL_HOST)
--mysql_config_port=$(MYSQL_PORT)
--mysql_config_user=$(DBCONFIG_USER)
--mysql_config_password=$(DBCONFIG_PASSWORD)
--enable_database_upgrade=true
State: Running
Started: Thu, 05 Aug 2021 20:36:26 -0400
Last State: Terminated
Reason: Error
Exit Code: 139
Started: Thu, 05 Aug 2021 19:36:25 -0400
Finished: Thu, 05 Aug 2021 20:36:25 -0400
Ready: True
Restart Count: 532
Liveness: tcp-socket :grpc-api delay=3s timeout=2s period=5s #success=1 #failure=3
Readiness: tcp-socket :grpc-api delay=3s timeout=2s period=5s #success=1 #failure=3
Environment:
DBCONFIG_USER: <set to the key 'username' in secret 'mysql-secret'> Optional: false
DBCONFIG_PASSWORD: <set to the key 'password' in secret 'mysql-secret'> Optional: false
MYSQL_DATABASE: <set to the key 'mlmdDb' of config map 'pipeline-install-config'> Optional: false
MYSQL_HOST: <set to the key 'dbHost' of config map 'pipeline-install-config'> Optional: false
MYSQL_PORT: <set to the key 'dbPort' of config map 'pipeline-install-config'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4qdbv (ro)
istio-proxy:
Container ID: docker://82b1c17880c0ec145d97cd26ca16526552122e7362e3fe4a0956555ebd63635b
Image: docker.io/istio/proxyv2:1.9.0
Image ID: docker-pullable://istio/proxyv2@sha256:286b821197d7a9233d1d889119f090cd9a9394468d3a312f66ea24f6e16b2294
Port: 15090/TCP
Host Port: 0/TCP
Args:
proxy
sidecar
--domain
$(POD_NAMESPACE).svc.cluster.local
--serviceCluster
metadata-grpc-deployment.kubeflow
--proxyLogLevel=warning
--proxyComponentLogLevel=misc:error
--log_output_level=default:info
--concurrency
2
State: Running
Started: Thu, 29 Jul 2021 18:16:00 -0400
Last State: Terminated
Reason: Error
Exit Code: 255
Started: Tue, 20 Jul 2021 21:28:03 -0400
Finished: Thu, 29 Jul 2021 17:51:17 -0400
Ready: True
Restart Count: 1
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 10m
memory: 40Mi
Readiness: http-get http://:15021/healthz/ready delay=1s timeout=3s period=2s #success=1 #failure=30
Environment:
JWT_POLICY: third-party-jwt
PILOT_CERT_PROVIDER: istiod
CA_ADDR: istiod.istio-system.svc:15012
POD_NAME: metadata-grpc-deployment-c8f784fdf-hdvgr (v1:metadata.name)
POD_NAMESPACE: kubeflow (v1:metadata.namespace)
INSTANCE_IP: (v1:status.podIP)
SERVICE_ACCOUNT: (v1:spec.serviceAccountName)
HOST_IP: (v1:status.hostIP)
CANONICAL_SERVICE: (v1:metadata.labels['service.istio.io/canonical-name'])
CANONICAL_REVISION: (v1:metadata.labels['service.istio.io/canonical-revision'])
PROXY_CONFIG: {}
ISTIO_META_POD_PORTS: [
{"name":"grpc-api","containerPort":8080,"protocol":"TCP"}
]
ISTIO_META_APP_CONTAINERS: container
ISTIO_META_CLUSTER_ID: Kubernetes
ISTIO_META_INTERCEPTION_MODE: REDIRECT
ISTIO_META_WORKLOAD_NAME: metadata-grpc-deployment
ISTIO_META_OWNER: kubernetes://apis/apps/v1/namespaces/kubeflow/deployments/metadata-grpc-deployment
ISTIO_META_MESH_ID: cluster.local
TRUST_DOMAIN: cluster.local
Mounts:
/etc/istio/pod from istio-podinfo (rw)
/etc/istio/proxy from istio-envoy (rw)
/var/lib/istio/data from istio-data (rw)
/var/run/secrets/istio from istiod-ca-cert (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-4qdbv (ro)
/var/run/secrets/tokens from istio-token (rw)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
istio-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
istio-podinfo:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.labels -> labels
metadata.annotations -> annotations
limits.cpu -> cpu-limit
requests.cpu -> cpu-request
istio-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 43200
istiod-ca-cert:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: istio-ca-root-cert
Optional: false
kube-api-access-4qdbv:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 31m (x350 over 7d2h) kubelet Container image "gcr.io/tfx-oss-public/ml_metadata_store_server:0.25.1" already present on machine
Normal Created 31m (x350 over 7d2h) kubelet Created container container
Normal Started 31m (x350 over 7d2h) kubelet Started container container
Both servers are part of the istio service mesh. In the past, the mysql process was implicated. That process remains up and stable with 0 restarts.
Both pods go to a state of Running then fail but they are not synchronized in this failure.
I should also add additional info that cache-server-* has restarted some 132 times in 15 days.
[mysql] 2021/08/05 17:34:16 packets.go:36: unexpected EOF
[mysql] 2021/08/05 18:34:16 packets.go:36: unexpected EOF
[mysql] 2021/08/05 19:34:16 packets.go:36: unexpected EOF
[mysql] 2021/08/05 20:34:16 packets.go:36: unexpected EOF
F0805 20:34:16.890285 1 error.go:325] driver: bad connection
goroutine 1 [running]:
github.com/golang/glog.stacks(0xc00043e300, 0xc000168000, 0x43, 0x96)
/go/pkg/mod/github.com/golang/glog@v0.0.0-20160126235308-23def4e6c14b/glog.go:769 +0xb8
github.com/golang/glog.(*loggingT).output(0x28daaa0, 0xc000000003, 0xc000412000, 0x280d034, 0x8, 0x145, 0x0)
/go/pkg/mod/github.com/golang/glog@v0.0.0-20160126235308-23def4e6c14b/glog.go:720 +0x372
github.com/golang/glog.(*loggingT).printf(0x28daaa0, 0x3, 0x199ee22, 0x2, 0xc0002f38d8, 0x1, 0x1)
/go/pkg/mod/github.com/golang/glog@v0.0.0-20160126235308-23def4e6c14b/glog.go:655 +0x14b
github.com/golang/glog.Fatalf(...)
/go/pkg/mod/github.com/golang/glog@v0.0.0-20160126235308-23def4e6c14b/glog.go:1148
github.com/kubeflow/pipelines/backend/src/common/util.TerminateIfError(...)
/go/src/github.com/kubeflow/pipelines/backend/src/common/util/error.go:325
main.initMysql(0x7fffe3a0c170, 0x5, 0x7fffe3a0c180, 0x5, 0x7fffe3a0c190, 0x4, 0x7fffe3a0c19f, 0x7, 0x7fffe3a0c1b1, 0x4, ...)
/go/src/github.com/kubeflow/pipelines/backend/src/cache/client_manager.go:157 +0x4f1
main.initDBClient(0x7fffe3a0c170, 0x5, 0x7fffe3a0c180, 0x5, 0x7fffe3a0c190, 0x4, 0x7fffe3a0c19f, 0x7, 0x7fffe3a0c1b1, 0x4, ...)
/go/src/github.com/kubeflow/pipelines/backend/src/cache/client_manager.go:71 +0x681
main.(*ClientManager).init(0xc0002f3e98, 0x7fffe3a0c170, 0x5, 0x7fffe3a0c180, 0x5, 0x7fffe3a0c190, 0x4, 0x7fffe3a0c19f, 0x7, 0x7fffe3a0c1b1, ...)
/go/src/github.com/kubeflow/pipelines/backend/src/cache/client_manager.go:57 +0x80
main.NewClientManager(...)
/go/src/github.com/kubeflow/pipelines/backend/src/cache/client_manager.go:169
main.main()
/go/src/github.com/kubeflow/pipelines/backend/src/cache/main.go:77 +0x66e
Maybe this is an issue with istio and service mesh configuration on bare metal.
In past incarnations of this bug, there was a repair offered up to establish PeerAuthentication - this resource does not exist in my cluster. The old suggestion was to apply:
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: "default"
spec:
mtls:
mode: STRICT
I do not know if this is still applicable in the the Istio 1.9.0 world.
Any progress or ideas or things to try? KFP probably represents one of the most beneficial parts of Kubeflow 1.3.0 to us. As of now all I have is Jupiter Notebooks with a Kubeflow Dashboard wrapper. I can't use KFP at all.
It's been 17 days and I have not heard any movement on this bug. I really want to get this resolved. I have switched my ISTIO service mesh to require mTLS (STRICT) everywhere. I have verified that the processes listed here as well as the mysql process are all within the service mesh - as evidenced by istio-proxy and istio-init being injected into the three pods.
This really does appear to be a problem with access to the mysql store.
My next experiments will be to stand up an ubuntu container with mysql client in the kubeflow namespace. From there I hope to be able to validate connectivity.
I can see two outcomes:
1) Connectivity works 2) Connectivity fails
In both cases the next step is still : What do I do given this additional information?
As an FYI, the mysql process' istio-proxy shows the following in the logs:
2021-08-26T15:15:46.314059Z info xdsproxy connected to upstream XDS server: istiod.istio-system.svc:15012
2021-08-26T15:44:53.280933Z warning envoy config StreamAggregatedResources gRPC config stream closed: 0,
2021-08-26T15:44:53.515221Z info xdsproxy connected to upstream XDS server: istiod.istio-system.svc:15012
2021-08-26T16:12:54.444630Z warning envoy config StreamAggregatedResources gRPC config stream closed: 0,
2021-08-26T16:12:54.806062Z info xdsproxy connected to upstream XDS server: istiod.istio-system.svc:15012
2021-08-26T16:45:50.518921Z warning envoy config StreamAggregatedResources gRPC config stream closed: 0,
2021-08-26T16:45:50.786400Z info xdsproxy connected to upstream XDS server: istiod.istio-system.svc:15012
UPDATE:
Was able to get to partial success:
What I did was to edit KubeFlow/v1.3.0/manifests/apps/pipeline/upstream/third-party/mysql/base/mysql-deployment.yaml Add in:
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
(NOTE: I also changed to use image: mysql:8 - but I don't think this is the issue)
And then, I underplayed and redeployed the KFP - I know I could have just applied the changes.
$ kustomize build apps/pipeline/upstream/env/platform-agnostic-multi-user | kubectl delete -f -
$ kustomize build apps/pipeline/upstream/env/platform-agnostic-multi-user | kubectl apply -f -
The downside is that metadata-grpc-deployment and metadata-writer now fail.
for metadata-grpc-deployment, the log reads:
2021-08-26 21:43:31.894833: F ml_metadata/metadata_store/metadata_store_server_main.cc:220] Non-OK-status: status status: Internal: mysql_real_connect failed: errno: 2059, error: Plugin caching_sha2_password could not be loaded: lib/mariadb/plugin/caching_sha2_password.so: cannot open shared object file: No such file or directoryMetadataStore cannot be created with the given connection config.
for metadata-writer, the logs read:
Failed to access the Metadata store. Exception: "no healthy upstream"
Next I tried to use "platform-agnostic-multi-user-legacy"...
$ kustomize build apps/pipeline/upstream/env/platform-agnostic-multi-user-legacy | kubectl apply -f -
And all processes are now running - except this now shows up:
Again: Any suggestions and assistance is highly appreciated.
I can confirm that this is issue is seen with KF 1.3.1 too. The bug is very annoying as KFP remains inaccessible.
Ohh, sorry this fell through the cracks. Let me take a look tmr.
/assign @zijianjoy This looks like an Istio compatibility issue.
When you disable sidecar injection, also find all the destination rules and delete the destination rule for mysql. Otherwise, all other clients will fail to access MySQL assuming mTLS is turned on.
Edit: this is a workaround by pulling MySQL out of the mesh.
If you want MySQL in the mesh, you need to check Istio documentation for troubleshooting instructions. I agree Istio is very hard to troubleshoot, got the same frustration when configuring these up.
I had the similar problem on metadatawriter, after the pod of deployment metadata-grpc-deployment
has envoy sidecar, the metadatawriter stopped the crashloopbackoff.
so please check any component in kubeflow
ns has required envoy sidecar, especially when you enforced the strict mTLS via peerAuthentication.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
I also encountered this issue. My env: k8s: 1.19.16 kubeflow manifests: v1.5.1 tag kustomize: 3.2.0
install command:
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
metadata-grpc-deployment-6f6f7776c5-pq6hq log:
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0321 09:45:37.213575 1 metadata_store_server_main.cc:236] Check failed: absl::OkStatus() == status (OK vs. INTERNAL: mysql_real_connect failed: errno: , error: [mysql-error-info='']) MetadataStore cannot be created with the given connection config.
*** Check failure stack trace: ***
mysql-f7b9b7dd4-5m4tb logs:
2023-03-21 09:47:24+00:00 [Note] [Entrypoint]: Entrypoint script for MySQL Server 5.7.33-1debian10 started.
2023-03-21 09:47:27+00:00 [ERROR] [Entrypoint]: mysqld failed while attempting to check config
command was: mysqld --ignore-db-dir=lost+found --datadir /var/lib/mysql --verbose --help --log-bin-index=/tmp/tmp.fbqLrtywDf
metadata-writer-d7ff8d4bc-8thjn logs:
Failed to access the Metadata store. Exception: "no healthy upstream"
Failed to access the Metadata store. Exception: "no healthy upstream"
Failed to access the Metadata store. Exception: "no healthy upstream"
Traceback (most recent call last):
File "/kfp/metadata_writer/metadata_writer.py", line 63, in <module>
mlmd_store = connect_to_mlmd()
File "/kfp/metadata_writer/metadata_helpers.py", line 62, in connect_to_mlmd
raise RuntimeError('Could not connect to the Metadata store.')
RuntimeError: Could not connect to the Metadata store.
And I tried k8s 1.21.2, still has the same problem.
NOTE: Everything I wrote below is totally not useful to this thread about cloud deployments. I just picked this issue out of the five I was looking at. I seriously thought it was contextually appropriate but it is not. The below information may be useful to you, but it was put here on accident essentially. I also think it's useful info and I don't want to just delete it. I will move what I've said below into it's own issue tomorrow.
I have this problem on some systems and believe it is because of the maxuser{instance,watches} settings. Which can be fixed by setting this.
sudo sysctl fs.inotify.max_user_instances=1280
sudo sysctl fs.inotify.max_user_watches=655360
I have found that it should be set before I deploy the cluster and kubeflow but if a cluster is running you can kill the pods that get hung but in my experience that has been a waste of time compared to just starting over from a clean slate.
I also had this issue on a machine that was running off of an SSD booted over USB-C and setting the maxuser* didn't fix it. I still don't really "know" why I couldn't get that system to work but basically I think this problem amounts to any situation where the disk throughput gets maxed out.
Kubeflow is a massive beast that does a LOT of stuff and on many machines you will hit limits where it can't function and it usually looks like what you've posted here. I know that's not the most technical solution to an open issue, but the problem is kind of squishy and I think can occur for multiple reasons.
I believe this problem is consistently produced by following the Charmed Kubeflow documentation on a fresh Ubuntu 22.04 or 20.04 system. I would be surprised if anyone can follow the current installation instructions with any success as the range of systems I'm testing on right now is some pretty high end server stuff and a couple consumer sort of desktop machines.
Not that this likely matters but I am doing GPU stuff so I've enabled GPU with microk8s. I can easily pass validation and push up a container to do nvidia-smi
so i don't think that would be causing the issue. I mention this because one of my systems had an infoROM issue on one GPU for a while and if I didn't disable it containers would fail to launch and the failures would look a lot like this issue sometimes (depending on a lot of factors). If anyone has an infoROM issue on a GPU, when you search for that Nvidia's official folks tell you it's unserviceable which is untrue. You just have to use nvflash
a windows tool and you can't download it directly from nvidia for some reason it's only available from some websites that IMO look pretty sketchy. Use at your own risk.
Since I do have a running microk8s 1.7 kubeflow setup, I think I should be able to solve this and commit back but I'm pretty certain the current Charmed Kubeflow setup instructions are very broken.
The numbers I used above were not sufficient to solve the too many open files
problem. I just ripped them from this issue and at one point a couple months ago, they did work for me. I would be very interested to know what changed in the packages that made the instances/watches requirements increase but I'm very much too busy to formally care. Here's the issue I got these numbers from.
https://github.com/kubeflow/manifests/issues/2087
I have no idea how to calculate what to raise the numbers to or the consequences of raising the limit too high. I just upped the largest digit by one.
sudo sysctl fs.inotify.max_user_instances=2280
sudo sysctl fs.inotify.max_user_watches=755360
And I totally forgot about this other issue I've had. GVFS backends! There is a client that runs on a default Ubuntu Desktop install for each user that's logged in. It is monitoring the disk and when you fire up Kubeflow it flips out and takes up 120-200% of the cpu for each user. I do not need this tool for my deployments. I'm unsure if this is a problem with a barebones server install but this is critical to solving the problem of launching kubeflow on a fresh installed Ubuntu Desktop 20.04/22.04 system.
sudo apt-get remove gvfs-backends -y
Hooray. I think this solves a reproducibility problem I've had for over a month. I haven't quite figured out if there are licensing issues with distributing this, but I've built an ubuntu LiveCD with CUDA, docker, and kubeflow (microk8s). I'll ping folks on slack and see if there's any interest in it. I've got a solid pipeline for doing the CI/CD for releases and man, the little things to get going are really a big barrier.
It is very possible that the maxuser* doesn't need to be raised so high if the gvfs-backends
are removed from the start but I will not be trying to figure that out explicitly in the near term.
Also tensorboard-controller
was failing because the ingress wasn't coming up correctly. That has been covered in a ton of issues.
https://discourse.charmhub.io/t/charmed-kubeflow-upgrade-error/7007/7
juju run --unit istio-pilot/0 -- "export JUJU_DISPATCH_PATH=hooks/config-changed; ./dispatch"
Solves that. I've tested this on multiple machines now and raising the maxuser* limits, uninstalling gvfs-backends, and fixing the ingress with the above command solves all of the problems consistently. I'm working on a Zero to Kubeflow tutorial, but I'll submit a PR for the Charmed Kubeflow instructions that covers these things if someone can point me at where to submit it.
I am realizing after some review that in this situation and the other issues I've read relating to a similar failure, most people are running in the cloud and not on bare metal. I do think the gist of what I've pointed out is still valid though but on this thread what I've posted is just not directly useful. It's squishy. These problems usually are related to disk throughput but that gets weird to sort in the cloud. Anyway.... All of what I said above has nothing to do with this issue I am realizing. Sorry for posting all this in the wrong place. I don't know where else to put all this info, so I'm going to leave it here for now.
I'll collect it and put it into it's own issue tomorrow and remove it from this thread. Sorry if this confused anyone. It's been a long day.
If you are using an external MySQL database (especially if its MySQL 8.0), you are likely experiencing this issue around support for caching_sha2_password
authentication, see here for more:
FYI, Kubeflow Pipelines itself fixed MySQL 8.0 and caching_sha2_password
support in 2.0.0-alpha.7
(ships with Kubeflow 1.7), but there is still an issue with the upstream ml-metadata
project:
I had the same problem. For me it was a cilium networking provider compatibility issue. I had to move to kubenet and it worked.
Can you elaborate the compatibility issue please? As I'm using cilium as well.
I had the same problem. For me it was a cilium networking provider compatibility issue. I had to move to kubenet and it worked.
Maybe you enable Cilium’s kubeProxyReplacement feature, you can disable this feature or set --config bpf-lb-sock-hostns-only=true ref : https://docs.cilium.io/en/latest/network/servicemesh/istio/#cilium-configuration
This issue looks like there are more than problems and resolutions so it's better to close it and if there are still pending issues let's open a new one to have a cleaner discussion thread.
/close
@rimolive: Closing this issue.
What steps did you take
I deployed Kubeflow 1.3 by using the manifests approach. I then repaired an issue with dex running on K8s v1.21
What happened:
The installation succeeded. All processes started up except the two. Both Metadata writer and ml-pipeline crash constantly and are restarted. ML-Pipeline always reports 1 of 2 running. Metadata-writer sometimes appears to be fully running then fails. No other kubeflow pods are having problems like this - even the mysql pod seems stable. I can only assume the failure of the metadata writer is due to a continued failure in ml-pipeline api-server.
The pod keeps getting terminated by something with a reason code of 137. See last image provided for details on the cycle time.
What did you expect to happen:
I expect that the pipeline tools install and operate normally. This has been a consistent problem going back to KF 1.1 with no adequate resolution
Environment:
I use the kubeflow 1.3 manifests deployment approach
This install is via the kubeflow 1.3.
NOT APPLICABLE
Anything else you would like to add:
kubectl logs ml-pipeline-9b68d49cb-x67mp ml-pipeline-api-server
kubectl describe pod ml-pipeline-9b68d49cb-x67mp
Metadata-Writer Logs:
Labels
/area backend
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.