rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.52k stars 264 forks source link

Kine is not working on RHEL based OS (RPM based installs and SELinux is enabled) #5924

Open mdrahman-suse opened 4 months ago

mdrahman-suse commented 4 months ago

Environmental Info: RKE2 Version:

rke2 version v1.30.0+rke2r1 (60e06c4dbccff996f717af8f4c532971f57264b4)
go version go1.22.2 X:boringcrypto

Also with v1.29.4+rke2r1 Node(s) CPU architecture, OS, and Version:

cat /etc/os-release
NAME="Red Hat Enterprise Linux"
VERSION="8.7 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.7"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.7 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.7
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.7"
[ec2-user@ip-172-31-3-155 ~]$ uname -a
Linux  4.18.0-425.3.1.el8.x86_64 #1 SMP Fri Sep 30 11:45:06 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

1 server, 1 external db

Describe the bug:

rke2-server is failing to start service with error when a datastore-endpoint is added to its configuration on an RHEL based OS and default installation method via RPM

Steps To Reproduce:

Expected behavior:

Cluster comes up successfully

Actual behavior:

rke2-server fails to start with error in the logs

Additional context / logs:

May 17 21:35:36 server1 rke2[18663]: time="2024-05-17T21:35:36Z" level=error msg="Error encountered while importing /var/lib/rancher/rke2/agent/images/cloud-controller-manager-image.txt: failed to pull images from /var/lib/rancher/rke2/agent/images/cloud-controller-manager-image.txt: image \"index.docker.io/rancher/rke2-cloud-provider:v1.29.3-build20240412\": not found"
May 17 21:35:51 server1 rke2[18663]: time="2024-05-17T21:35:51Z" level=error msg="Error encountered while importing /var/lib/rancher/rke2/agent/images/kube-apiserver-image.txt: failed to pull images from /var/lib/rancher/rke2/agent/images/kube-apiserver-image.txt: image \"index.docker.io/rancher/hardened-kubernetes:v1.30.0-rke2r1-build20240506\": not found"
May 17 21:35:58 server1 rke2[18663]: time="2024-05-17T21:35:58Z" level=error msg="Error encountered while importing /var/lib/rancher/rke2/agent/images/runtime-image.txt: failed to pull images from /var/lib/rancher/rke2/agent/images/runtime-image.txt: image \"index.docker.io/rancher/rke2-runtime:v1.30.0-rke2r1\": not found"
May 17 21:50:17 server1 rke2[18663]: time="2024-05-17T21:50:17Z" level=error msg="Failed to save TLS secret after controller init: timed out waiting for the condition"
May 17 21:50:24 server1 rke2[19581]: time="2024-05-17T21:50:24Z" level=info msg="Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: connection refused\""
May 17 22:05:31 server1 rke2[20423]: time="2024-05-17T22:05:31Z" level=info msg="Waiting for cri connection: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial unix /run/k3s/containerd/containerd.sock: connect: connection refused\""
mdrahman-suse commented 3 months ago

So apparently it looks like selinux policies are not compatible with Kine, thanks @vitorsavian for identifying the root cause. Once I disabled selinux from the RHEL OS, I was able to create a cluster

$ sestatus
SELinux status:                 disabled

$ rke2 -v
rke2 version v1.30.0+rke2r1 (60e06c4dbccff996f717af8f4c532971f57264b4)
go version go1.22.2 X:boringcrypto

$ kga
NAME              STATUS   ROLES                  AGE     VERSION          INTERNAL-IP   EXTERNAL-IP    OS-IMAGE                               KERNEL-VERSION              CONTAINER-RUNTIME
node/server1      Ready    control-plane,master   7m31s   v1.3x.0+rke2r1   xxx.xx.x.15   x.xxx.x.190    Red Hat Enterprise Linux 8.7 (Ootpa)   4.18.0-425.3.1.el8.x86_64   containerd://1.7.11-k3s2

NAMESPACE     NAME                                                                     READY   STATUS      RESTARTS   AGE     IP            NODE      NOMINATED NODE   READINESS GATES
kube-system   pod/kube-scheduler-server1                                               1/1     Running     0          7m29s   xxx.xx.x.15   server1   <none>           <none>
kube-system   pod/kube-apiserver-server1                                               1/1     Running     0          7m27s   xxx.xx.x.15   server1   <none>           <none>
kube-system   pod/kube-controller-manager-server1                                      1/1     Running     0          7m29s   xxx.xx.x.15   server1   <none>           <none>
kube-system   pod/cloud-controller-manager-server1                                     1/1     Running     0          7m27s   xxx.xx.x.15   server1   <none>           <none>
kube-system   pod/kube-proxy-server1                                                   1/1     Running     0          7m22s   xxx.xx.x.15   server1   <none>           <none>
kube-system   pod/helm-install-rke2-coredns-q5rrx                                      0/1     Completed   0          7m12s   xxx.xx.x.15   server1   <none>           <none>
kube-system   pod/helm-install-rke2-canal-mglvd                                        0/1     Completed   0          7m12s   xxx.xx.x.15   server1   <none>           <none>
kube-system   pod/rke2-canal-knvld                                                     2/2     Running     0          6m54s   xxx.xx.x.15   server1   <none>           <none>
kube-system   pod/helm-install-rke2-snapshot-controller-crd-msml2                      0/1     Completed   0          7m12s   xx.xx.x.2     server1   <none>           <none>
kube-system   pod/helm-install-rke2-metrics-server-84566                               0/1     Completed   0          7m12s   xx.xx.x.3     server1   <none>           <none>
kube-system   pod/rke2-coredns-rke2-coredns-autoscaler-5749cd7b8b-r58f9                1/1     Running     0          6m55s   xx.xx.x.5     server1   <none>           <none>
kube-system   pod/helm-install-rke2-snapshot-controller-mfncj                          0/1     Completed   0          7m12s   xx.xx.x.7     server1   <none>           <none>
kube-system   pod/rke2-snapshot-controller-7dcf5d5b46-992x7                            1/1     Running     0          5m58s   xx.xx.x.10    server1   <none>           <none>
kube-system   pod/helm-install-rke2-snapshot-validation-webhook-ngpcq                  0/1     Completed   0          7m12s   xx.xx.x.8     server1   <none>           <none>
kube-system   pod/rke2-snapshot-validation-webhook-bf7bbd6fc-k52zs                     1/1     Running     0          5m55s   xx.xx.x.11    server1   <none>           <none>
kube-system   pod/rke2-metrics-server-868fc8795f-49h2q                                 1/1     Running     0          6m4s    xx.xx.x.9     server1   <none>           <none>
kube-system   pod/rke2-coredns-rke2-coredns-64dcf4f58b-v9m7d                           1/1     Running     0          6m55s   xx.xx.x.4     server1   <none>           <none>
kube-system   pod/helm-install-rke2-ingress-nginx-f6fkx                                0/1     Completed   0          7m12s   xx.xx.x.6     server1   <none>           <none>
kube-system   pod/rke2-ingress-nginx-controller-pd6g6                                  1/1     Running     0          5m38s   xx.xx.x.13    server1   <none>           <none>

I still think its an issue likely as the RHEL OSs have selinux enabled by default

brandond commented 3 months ago

Sounds like we'll need changes to rke2-selinux?