kserve / modelmesh-serving

Controller for ModelMesh
Apache License 2.0
204 stars 114 forks source link

etcd fails to start on version 0.9.0 on OpenShift #210

Closed crobby closed 1 year ago

crobby commented 2 years ago

The etcd deployment/pod fails to start on OpenShift

{"level":"warn","ts":"2022-08-11T14:03:05.108Z","caller":"etcdmain/etcd.go:146","msg":"failed to start etcd","error":"cannot access data directory: mkdir default.etcd: permission denied"}
{"level":"fatal","ts":"2022-08-11T14:03:05.108Z","caller":"etcdmain/etcd.go:204","msg":"discovery failed","error":"cannot access data directory: mkdir default.etcd: permission denied","stacktrace":"go.etcd.io/etcd/server/v3/etcdmain.startEtcdOrProxyV2\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/etcd.go:204\ngo.etcd.io/etcd/server/v3/etcdmain.Main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/etcdmain/main.go:40\nmain.main\n\t/go/src/go.etcd.io/etcd/release/etcd/server/main.go:32\nruntime.main\n\t/go/gos/go1.16.15/src/runtime/proc.go:225"}

To Reproduce Steps to reproduce the behavior:

  1. Attempt to install v0.9.0 on OpenShift

Expected behavior All pods come up without errr

Environment (please complete the following information):

Client Version: 4.11.0-202207191902.p0.g7075089.assembly.stream-7075089 Kustomize Version: v4.5.4 Server Version: 4.11.0-rc.6 Kubernetes Version: v1.24.0+9546431

Additional context I'm guessing this is likely due to OpenShift pods being ran as a user other than root, but I'm not sure why the old version 0.8.0, which just ran a pod instead of a deployment did not have this problem.

pvaneck commented 2 years ago

Another change was https://github.com/kserve/modelmesh-serving/pull/151 where the quickstart etcd version was updated to v3.5.4 from the latest tag which actually corresponds to quite an old version (~v3.3.8) . If you swap to an older version or even just the latest tag again, does the deployment come up properly? I am wondering if some change with the newer version of the etcd image caused this.

crobby commented 2 years ago

Another change was #151 where the quickstart etcd version was updated to v3.5.4 from the latest tag which actually corresponds to quite an old version (~v3.3.8) . If you swap to an older version or even just the latest tag again, does the deployment come up properly? I am wondering if some change with the newer version of the etcd image caused this.

I do see the same problem when using the latest tag of the etcd image. It looks like the non-deployment (naked pod) version in 0.8.0 was running as root, but running as part of a deployment, it runs as non-root. When I look in the container, it seem to be trying to create the data directory in / which will always fail for non-root.

njhill commented 2 years ago

@crobby you could try adding workingDir: $HOME to the container spec. Alternatively you could add a --data-dir $HOME/etcd cmd line flag.

crobby commented 2 years ago

@crobby you could try adding workingDir: $HOME to the container spec. Alternatively you could add a --data-dir $HOME/etcd cmd line flag.

That doesn't seem to do the trick. At runtime, it is running as the (randomized) user: 1000670000

Here is the entry from /etc/passwd for that user inside the container 1000670000:x:1000670000:0:1000670000 user:/:/sbin/nologin

Looks like / is the $HOME, which doesn't seem to have much of a chance of being writable by anything other than root.

njhill commented 2 years ago

OK how about adding

- --data-dir
- /tmp/etcd.data

to the container args? The standalone container is only intended for dev/temporary use anyhow.

crobby commented 2 years ago

OK how about adding

- --data-dir
- /tmp/etcd.data

to the container args? The standalone container is only intended for dev/temporary use anyhow.

I will give that a try on Monday, thanks. What sort of setup would you recommend for production use?

crobby commented 2 years ago

OK how about adding

- --data-dir
- /tmp/etcd.data

to the container args? The standalone container is only intended for dev/temporary use anyhow.

I will give that a try on Monday, thanks. What sort of setup would you recommend for production use?

using /tmp/"whatever" does appear to work. Thanks.

njhill commented 2 years ago

What sort of setup would you recommend for production use?

@crobby a small multi-member etcd cluster with TLS configured. I'm not sure whether there's an OpenShift operator for this apart from the one managing the Kube-backing etcd. There is a public operator https://github.com/improbable-eng/etcd-cluster-operator but that does not appear to be actively maintained. I think there are also example helm charts out the which could be used.

Note that the system will recover fine if data in etcd is lost so persistence isn't critical. The recommendation here is for stability/scalability/security.

njhill commented 1 year ago

I'm not sure why I didn't think of this originally, but @deleeuwblue suggested we should just change this in our default quickstart manifest for etcd. Let's do that :)

ckadner commented 1 year ago

I'm not sure why I didn't think of this originally, but @deleeuwblue suggested we should just change this in our default quickstart manifest for etcd. Let's do that :)

Suggestion is implemented in #321