rancher / rke2-selinux

RKE2 selinux + RPM packaging for selinux
Apache License 2.0
21 stars 21 forks source link

[BUG] upstream container-selinux change causing no start #36

Closed clemenko closed 1 year ago

clemenko commented 1 year ago

Thanks to an update with https://rockylinux.pkgs.org/9/rockylinux-appstream-x86_64/container-selinux-2.205.0-1.el9_2.noarch.rpm.html it appears to be causing etcd to not start on a new Rocky box. There currently is no way to tell it to load an older version of container-selinux to test.

Installed with

[root@flux ~]# curl -sfL https://get.rke2.io | INSTALL_RKE2_CHANNEL=v1.24 sh -
[INFO]  using stable RPM repositories
[INFO]  using 1.24 series from channel stable
DigitalOcean Droplet Agent                                                                                                                           20 kB/s | 3.3 kB     00:00    
Rancher RKE2 Common (v1.24)                                                                                                                         3.7 kB/s | 1.8 kB     00:00    
Rancher RKE2 1.24 (v1.24)                                                                                                                           9.9 kB/s | 6.3 kB     00:00    
Dependencies resolved.
====================================================================================================================================================================================
 Package                                    Architecture                    Version                                       Repository                                           Size
====================================================================================================================================================================================
Installing:
 rke2-server                                x86_64                          1.24.13~rke2r1-0.el8                          rancher-rke2-1.24-stable                            8.8 k
Installing dependencies:
 container-selinux                          noarch                          3:2.205.0-1.el9_2                             appstream                                            50 k
 rke2-common                                x86_64                          1.24.13~rke2r1-0.el8                          rancher-rke2-1.24-stable                             19 M
 rke2-selinux                               noarch                          0.11-1.el8                                    rancher-rke2-common-stable                           21 k

Transaction Summary
====================================================================================================================================================================================
Install  4 Packages

Total download size: 19 M
Installed size: 76 M
Downloading Packages:
(1/4): rke2-server-1.24.13~rke2r1-0.el8.x86_64.rpm                                                                                                   17 kB/s | 8.8 kB     00:00    
(2/4): rke2-selinux-0.11-1.el8.noarch.rpm                                                                                                            37 kB/s |  21 kB     00:00    
(3/4): container-selinux-2.205.0-1.el9_2.noarch.rpm                                                                                                 662 kB/s |  50 kB     00:00    
(4/4): rke2-common-1.24.13~rke2r1-0.el8.x86_64.rpm                                                                                                   11 MB/s |  19 MB     00:01    
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Total                                                                                                                                                10 MB/s |  19 MB     00:01     
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                                                                                            1/1 
  Running scriptlet: container-selinux-3:2.205.0-1.el9_2.noarch                                                                                                                 1/4 
  Installing       : container-selinux-3:2.205.0-1.el9_2.noarch                                                                                                                 1/4 
  Running scriptlet: container-selinux-3:2.205.0-1.el9_2.noarch                                                                                                                 1/4 
  Running scriptlet: rke2-selinux-0.11-1.el8.noarch                                                                                                                             2/4 
  Installing       : rke2-selinux-0.11-1.el8.noarch                                                                                                                             2/4 
  Running scriptlet: rke2-selinux-0.11-1.el8.noarch                                                                                                                             2/4 
Conflicting name type transition rules
Binary policy creation failed at /var/lib/selinux/targeted/tmp/modules/400/rke2/cil:324
Failed to generate binary
semodule:  Failed!

  Installing       : rke2-common-1.24.13~rke2r1-0.el8.x86_64                                                                                                                    3/4 
  Installing       : rke2-server-1.24.13~rke2r1-0.el8.x86_64                                                                                                                    4/4 
  Running scriptlet: rke2-server-1.24.13~rke2r1-0.el8.x86_64                                                                                                                    4/4 
  Running scriptlet: container-selinux-3:2.205.0-1.el9_2.noarch                                                                                                                 4/4 
  Running scriptlet: rke2-selinux-0.11-1.el8.noarch                                                                                                                             4/4 
  Running scriptlet: rke2-server-1.24.13~rke2r1-0.el8.x86_64                                                                                                                    4/4 
  Verifying        : rke2-selinux-0.11-1.el8.noarch                                                                                                                             1/4 
  Verifying        : rke2-common-1.24.13~rke2r1-0.el8.x86_64                                                                                                                    2/4 
  Verifying        : rke2-server-1.24.13~rke2r1-0.el8.x86_64                                                                                                                    3/4 
  Verifying        : container-selinux-3:2.205.0-1.el9_2.noarch                                                                                                                 4/4 

Installed:
  container-selinux-3:2.205.0-1.el9_2.noarch       rke2-common-1.24.13~rke2r1-0.el8.x86_64       rke2-selinux-0.11-1.el8.noarch       rke2-server-1.24.13~rke2r1-0.el8.x86_64      

Complete!

and the error :


May 16 21:15:23 flux rke2[5039]: {"level":"warn","ts":"2023-05-16T21:15:23.399Z","logger":"etcd-client","caller":"v3@v3.5.4-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007ea540/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:2379: connect: connection refused\""}
May 16 21:15:23 flux rke2[5039]: time="2023-05-16T21:15:23Z" level=info msg="Failed to test data store connection: context deadline exceeded"
May 16 21:15:23 flux rke2[5039]: time="2023-05-16T21:15:23Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
May 16 21:15:27 flux rke2[5039]: time="2023-05-16T21:15:27Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
May 16 21:15:28 flux rke2[5039]: time="2023-05-16T21:15:28Z" level=info msg="Tunnel server egress proxy waiting for runtime core to become available"
May 16 21:15:32 flux rke2[5039]: time="2023-05-16T21:15:32Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"

It looks like rke2-selinux needs to be updated for the up stream changes.

The work around is it

yum install -y http://dl.rockylinux.org/pub/rocky/9.1/AppStream/x86_64/os/Packages/c/container-selinux-2.189.0-1.el9.noarch.rpm

before the install.sh script.

zackbradys commented 1 year ago

I'm seeing this same issue. It's happened on multiple OS's including on Rocky Linux and RHEL. Will stick to the work around until the upstream issues are released.

Also for the workaround, please ensure you don’t run yum update.

galal-hussein commented 1 year ago

I have released new rpms for rke2-selinux, rke2-server, rke2-agent, rke2-common in the testing channel, it seems to be working fine with el9 distros and the recent changes to container-selinux

clemenko commented 1 year ago

confirmed work. Will the install.sh pull it? How soon until the QA is complete and the rpm is published?

vctrstrm commented 1 year ago

+1 Running into this across all latest AMI's for Rocky 8, RHEL 8 on AWS and GCP

kkch commented 1 year ago

+1 running into this on RHEL8 AMI on AWS, also can confirm running testing channel seems to fix the issue.

rancher-max commented 1 year ago

Testing has concluded and details provided in https://github.com/rancher/rke2/issues/4285#issuecomment-1562086153 and https://github.com/rancher/rke2-selinux/issues/33#issuecomment-1562089891.

This should be fixed now via the install script at https://get.rke2.io and the testing channel. latest and stable channels will get the fix in line with the May patch releases (very soon).

clemenko commented 1 year ago

When will we see the rke2-selinux package updated?

Also :

[root@rke1 ~]# curl -sfL https://get.rke2.io | INSTALL_RKE2_CHANNEL=v1.24 sh -
[INFO]  using stable RPM repositories
[INFO]  using 1.24 series from channel stable
Rancher RKE2 Common (v1.24)                                                                                                                         666  B/s | 389  B     00:00    
Errors during downloading metadata for repository 'rancher-rke2-common-stable':
  - Status code: 404 for https://rpm.rancher.io/rke2/stable/common/centos/9/noarch/repodata/repomd.xml (IP: 172.67.129.95)
Error: Failed to download metadata for repo 'rancher-rke2-common-stable': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried
[root@rke1 ~]# 

Did this break too?

zackbradys commented 1 year ago

^^ I'm curious as well! We definitely need a seamless experience for customers. selinux is a hard requirements across the board.

rancher-max commented 1 year ago

ah we went ahead and reverted the install script changes for now. We'll update it all at the same time. Currently it's only updated in the testing channel, so we'll have to wait to update everything as we release these May patches (later this week or early next).

clemenko commented 1 year ago

Can we re-open this issue until the rke2-selinux package is fixed?

I just tested with the latest testing.9 release and I am seeing issues. RKE2 starts. But zero longhorn pvcs are able to attach.

May 26 14:40:48 localhost setroubleshoot[34188]: SELinux is preventing /usr/sbin/iscsiadm from using the dac_override capability.#012#012*****  Plugin dac_override (91.4 confidence) suggests   **********************#012#012If you want to help identify if domain needs this access or you have a file with the wrong permissions on your system#012Then turn on full auditing to get path information about the offending file and generate the error again.#012Do#012#012Turn on full auditing#012# auditctl -w /etc/shadow -p w#012Try to recreate AVC. Then execute#012# ausearch -m avc -ts recent#012If you see PATH record check ownership/permissions on file, and fix it,#012otherwise report as a bugzilla.#012#012*****  Plugin catchall (9.59 confidence) suggests   **************************#012#012If you believe that iscsiadm should have the dac_override capability by default.#012Then you should report this as a bug.#012You can generate a local policy module to allow this access.#012Do#012allow this access for now by executing:#012# ausearch -c 'iscsiadm' --raw | audit2allow -M my-iscsiadm#012# semodule -X 300 -i my-iscsiadm.pp#012

results from ausearch -m avc -ts recent

time->Fri May 26 14:45:50 2023
type=PROCTITLE msg=audit(1685112350.374:4306): proctitle=697363736961646D002D6D00646973636F76657279002D740073656E6474617267657473002D700031302E34322E302E38
type=PATH msg=audit(1685112350.374:4306): item=1 name="/var/lib/iscsi/nodes/iqn.2019-10.io.longhorn:pvc-7ec0aeb4-5d3a-4996-be4c-1bb9cab001a4/10.42.0.8,3260,1" nametype=CREATE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=PATH msg=audit(1685112350.374:4306): item=0 name="/var/lib/iscsi/nodes/iqn.2019-10.io.longhorn:pvc-7ec0aeb4-5d3a-4996-be4c-1bb9cab001a4/" inode=100878733 dev=fc:01 mode=040600 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:iscsi_var_lib_t:s0 nametype=PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0
type=CWD msg=audit(1685112350.374:4306): cwd="/"
type=SYSCALL msg=audit(1685112350.374:4306): arch=c000003e syscall=83 success=no exit=-13 a0=557b96f3e690 a1=1f8 a2=ffffffffffffff00 a3=0 items=2 ppid=132762 pid=134791 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iscsiadm" exe="/usr/sbin/iscsiadm" subj=system_u:system_r:iscsid_t:s0 key=(null)
type=AVC msg=audit(1685112350.374:4306): avc:  denied  { dac_override } for  pid=134791 comm="iscsiadm" capability=1  scontext=system_u:system_r:iscsid_t:s0 tcontext=system_u:system_r:iscsid_t:s0 tclass=capability permissive=0
galal-hussein commented 1 year ago

@clemenko As far as I can see, we have no specific policies for ISCSI or any related longhorn policies in our policies, nor I can see any specific ISCSI related policies in container-selinux as well

clemenko commented 1 year ago

then why does it work with the older container-selinux package? The two issues are related. It has to do with how the containers are tagged.

zackbradys commented 1 year ago

Are we able to re-open this issue until all issues around it are resolved? If we don't have a complete fix, especially with supported and prospect customers still reaching out to us about it, it should not be closed. @galal-hussein

rancher-max commented 1 year ago

Our general practice is to close issues after fixes have been validated, but before they are fully available in all channels. This will be available with the upcoming releases, which will be out and in all channels later this week.

clemenko commented 1 year ago

FYI this appears to still be broken.

zackbradys commented 1 year ago

+1 to Andy's comment. Tested with RKE2 v1.24.14 and Longhorn still failing due to selinux.

p-kimberley commented 1 year ago

@galal-hussein I'm also still experiencing this issue.

With a clean Rocky Linux 9.2 VM, attempting to install RKE2 using the install script fails due to ETCD failing to start.

Works with Rocky 8.8.

Versions

rancher-max commented 1 year ago

Hmmm what are your settings? I see that same scenario working:

$ uname -a
Linux ip-172-31-42-128.us-east-2.compute.internal 5.14.0-284.11.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Tue May 9 17:09:15 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/os-release 
NAME="Rocky Linux"
VERSION="9.2 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.2"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.2 (Blue Onyx)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2032-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
ROCKY_SUPPORT_PRODUCT_VERSION="9.2"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.2"
$ rpm -qa container-selinux rke2-server rke2-selinux
container-selinux-2.205.0-1.el9_2.noarch
rke2-selinux-0.13-1.el9.noarch
rke2-server-1.25.10~rke2r1-0.el9.x86_64
$ kubectl get nodes,pods -A -o wide
NAME                                               STATUS   ROLES                       AGE     VERSION           INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                      KERNEL-VERSION                 CONTAINER-RUNTIME
node/ip-xxx-xx-xx-xxx.us-east-2.compute.internal   Ready    control-plane,etcd,master   9m43s   v1.25.10+rke2r1   xxx.xx.xx.xxx   <none>        Rocky Linux 9.2 (Blue Onyx)   5.14.0-284.11.1.el9_2.x86_64   containerd://1.7.1-k3s1

NAMESPACE     NAME                                                                       READY   STATUS      RESTARTS   AGE     IP              NODE                                          NOMINATED NODE   READINESS GATES
kube-system   pod/cloud-controller-manager-ip-xxx-xx-xx-xxx.us-east-2.compute.internal   1/1     Running     0          9m32s   xxx.xx.xx.xxx   ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/etcd-ip-xxx-xx-xx-xxx.us-east-2.compute.internal                       1/1     Running     0          9m      xxx.xx.xx.xxx   ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/helm-install-rke2-canal-2p4t5                                          0/1     Completed   0          9m13s   xxx.xx.xx.xxx   ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/helm-install-rke2-coredns-7b9zv                                        0/1     Completed   0          9m13s   xxx.xx.xx.xxx   ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/helm-install-rke2-ingress-nginx-4bzh2                                  0/1     Completed   0          9m13s   10.42.0.2       ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/helm-install-rke2-metrics-server-8t4gm                                 0/1     Completed   0          9m13s   10.42.0.7       ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/helm-install-rke2-snapshot-controller-crd-gjxg6                        0/1     Completed   0          9m13s   10.42.0.4       ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/helm-install-rke2-snapshot-controller-l5gnt                            0/1     Completed   1          9m13s   10.42.0.6       ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/helm-install-rke2-snapshot-validation-webhook-rb2p7                    0/1     Completed   0          9m13s   10.42.0.3       ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/kube-apiserver-ip-xxx-xx-xx-xxx.us-east-2.compute.internal             1/1     Running     0          9m40s   xxx.xx.xx.xxx   ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/kube-controller-manager-ip-xxx-xx-xx-xxx.us-east-2.compute.internal    1/1     Running     0          9m34s   xxx.xx.xx.xxx   ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/kube-proxy-ip-xxx-xx-xx-xxx.us-east-2.compute.internal                 1/1     Running     0          9m28s   xxx.xx.xx.xxx   ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/kube-scheduler-ip-xxx-xx-xx-xxx.us-east-2.compute.internal             1/1     Running     0          9m34s   xxx.xx.xx.xxx   ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/rke2-canal-gdnq7                                                       2/2     Running     0          9m4s    xxx.xx.xx.xxx   ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/rke2-coredns-rke2-coredns-6b9548f79f-hfcrx                             1/1     Running     0          9m5s    10.42.0.5       ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/rke2-coredns-rke2-coredns-autoscaler-57647bc7cf-229f7                  1/1     Running     0          9m5s    10.42.0.8       ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/rke2-ingress-nginx-controller-wgblh                                    1/1     Running     0          8m3s    10.42.0.13      ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/rke2-metrics-server-78b84fff48-j827c                                   1/1     Running     0          8m22s   10.42.0.9       ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/rke2-snapshot-controller-849d69c748-8gnst                              1/1     Running     0          8m9s    10.42.0.12      ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
kube-system   pod/rke2-snapshot-validation-webhook-654f6677b-n95tx                       1/1     Running     0          8m19s   10.42.0.11      ip-xxx-xx-xx-xxx.us-east-2.compute.internal   <none>           <none>
p-kimberley commented 1 year ago

@rancher-max Thanks for the reply. Turns out this was a missing firewalld rule (port 2380/tcp for ETCD). After adding the rule, the Rocky 9.2 node joined the existing cluster.

zackbradys commented 1 year ago

@p-kimberley Glad to hear it. The most recent release on 30 May 2023 fixed this issue.