rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.52k stars 264 forks source link

Upgrading rke2 cluster (single node) fails #3328

Closed echo-devnull closed 2 years ago

echo-devnull commented 2 years ago

While trying to upgrade from v1.23.9+rke2r1 to v1.24.3+rke2r1 I used the "https://docs.rke2.io/upgrade/automated_upgrade/" automated upgrade way of upgrading.

But after applying the plan, it got stuck with restarting the server: Logs: https://pastebin.com/raw/Ankj2hdt

{"level":"warn","ts":"2022-09-11T11:59:57.717Z","logger":"etcd-client","caller":"v3@v3.5.4-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc000cdb340/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}

Environmental Info: RKE2 Version: from v1.23.9+rke2r1 to v1.24.3+rke2r1

Node(s) CPU architecture, OS, and Version: Linux excelsior 5.10.0-18-amd64 #1 SMP Debian 5.10.140-1 (2022-09-02) x86_64 GNU/Linux

Cluster Configuration: Single node server

Describe the bug: Upgrading seems to fail

Expected behavior: Using the automated upgrade path I hoped the server would come back cleanly after restarting,

Actual behavior: After a rke2-server restart the service does not actually start up

My guess is that this is because of the single node nature of my setup. It is trying to reach other etcd nodes to attach itself to ?

brandond commented 2 years ago

It looks like etcd is failing to start. Can you attach the rke2-server logs from journald, the etcd pod logs from/var/log/pods/kube-system_etcd-*/*, and the output of CONTAINER_RUNTIME_SOCKET=/var/run/k3s/containerd/containerd.sock /var/lib/rancher/rke2/bin/crictl ps ?

echo-devnull commented 2 years ago

Goodmorning! Thanks for reading and looking into this with me.

Output of the command:

root@excelsior /var/log/pods # CONTAINER_RUNTIME_SOCKET=/var/run/k3s/containerd/containerd.sock /var/lib/rancher/rke2/bin/crictl ps
WARN[0000] runtime connect using default endpoints: [unix:///var/run/dockershim.sock unix:///run/containerd/containerd.sock unix:///run/crio/crio.sock unix:///var/run/cri-dockerd.sock]. As the default settings are now deprecated, you should set the endpoint instead. 
ERRO[0000] unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/dockershim.sock: connect: no such file or directory" 
ERRO[0000] unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/containerd/containerd.sock: connect: no such file or directory" 
ERRO[0000] unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /run/crio/crio.sock: connect: no such file or directory" 
ERRO[0000] unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/cri-dockerd.sock: connect: no such file or directory" 
FATA[0000] unable to determine runtime API version: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial unix /var/run/cri-dockerd.sock: connect: no such file or directory"

Those pod logs do not seem to exist:

root@excelsior /var/log/pods # ls -ltrR kube-system_etcd-excelsior_*
kube-system_etcd-excelsior_f3a30bd4b4f10029e180c53180190a88:
total 4
drwxr-xr-x 2 root root 4096 Sep 11 08:08 etcd

kube-system_etcd-excelsior_f3a30bd4b4f10029e180c53180190a88/etcd:
total 0

kube-system_etcd-excelsior_add8d6fd3b8f7f02d9525a5bfd28943d:
total 4
drwxr-xr-x 2 root root 4096 Sep 11 11:59 etcd

kube-system_etcd-excelsior_add8d6fd3b8f7f02d9525a5bfd28943d/etcd:
total 0

And the logs are here: https://nextcloud.maas-martin.nl/s/QqzPgaWDoy2GQGH

brandond commented 2 years ago

Sorry, I was typing that command from memory - try CONTAINER_RUNTIME_ENDPOINT=unix:///var/run/k3s/containerd/containerd.sock /var/lib/rancher/rke2/bin/crictl ps

Can you also grab the logs at /var/lib/rancher/rke2/agent/containerd/containerd.log and /var/lib/rancher/rke2/agent/logs/kubelet.log ?

echo-devnull commented 2 years ago

I was wondering about that command ;-) Silly I did not realize what you where actually asking ;-)

root@excelsior ~ # CONTAINER_RUNTIME_ENDPOINT=unix:///var/run/k3s/containerd/containerd.sock /var/lib/rancher/rke2/bin/crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID              POD
788b20c6acc62       2d41fbfc20342       43 hours ago        Running             kube-proxy          9                   1ae71c7500b80       kube-proxy-excelsior

And the requested logs: https://nextcloud.maas-martin.nl/s/dZ8GnZimoci8sM4

brandond commented 2 years ago

From kubelet.log: E0913 07:12:52.147100 169039 remote_runtime.go:421] "CreateContainer in sandbox from runtime service failed" err="rpc error: code = Unknown desc = failed to create containerd container: get apparmor_parser version: exec: \"apparmor_parser\": executable file not found in $PATH" podSandboxID="c9b5003de37a3913f20fde2f738524fa6474fc8881e7dfffa6acf172bdee78e2"

This appears to be a duplicate of https://github.com/rancher/rke2/issues/1806 - you need to install the apparmor-parser package, which is required by newer releases of containerd when apparmor is enabled.

echo-devnull commented 2 years ago

Goodmorning!

Welp, that indeed fixed the issue! Thank you so much! Did I miss that in the documentation? This was/is a clean debian 11 install and did not come with that by default.

Solved by:

sudo apt install apparmor

Thanks!

brandond commented 2 years ago

We did add it to the docs a while back: https://docs.rke2.io/install/quickstart/#prerequisites