okd-project / okd

The self-managing, auto-upgrading, Kubernetes distribution for everyone
https://okd.io
Apache License 2.0
1.72k stars 295 forks source link

iscsiadm blocked by SELinux from mounting OpenEBS PVs #1438

Closed ceagan closed 1 month ago

ceagan commented 1 year ago

Describe the bug During the upgrade from 4.11.0-0.okd-2022-11-19-050030 to 4.11.0-0.okd-2022-12-02-145640, we started having problems with OpenEBS PVs mounting. This blocked the upgrade from completing for us because it affected image-registry. We traced the problem down to SELinux blocking iscsiadm from performing dac_override. Disabling SELinux on the host node allowed the mount and upgrade to complete. We had to perform this on each node that had a PV, including those that were not related to the upgrade in order to mount all the OpenEBS PVs used by worker pods. We then re-enabled SELinux on each node.

Version 4.11.0-0.okd-2022-12-02-145640

How reproducible Unknown

Log bundle https://drive.google.com/file/d/1PgUlirAJMVFmbdim9QdMXq-HpEkHB-4i/view?usp=share_link

Relevant Logs

Dec 09 18:05:00.451300 okd-node-02.okd.example.com hyperkube[1888]: I1209 18:05:00.451136 1888 reconciler.go:254] "operationExecutor.MountVolume started for volume \"pvc-1ad97794-f713-453e-8044-3b6605abd75c\" (UniqueName: \"kubernetes.io/csi/cstor.csi.openebs.io^pvc-1ad97794-f713-453e-8044-3b6605abd75c\") pod \"example-fcos-moderate-infra-rs-76b58ff799-ntwg5\" (UID: \"bbca7b2b-f3ec-4ffb-9616-fb675357e935\") " pod="openshift-compliance/example-fcos-moderate-infra-rs-76b58ff799-ntwg5"
Dec 09 18:05:01.789000 okd-node-02.okd.example.com audit[216728]: AVC avc: denied { dac_override } for pid=216728 comm="iscsiadm" capability=1 scontext=system_u:system_r:iscsid_t:s0 tcontext=system_u:system_r:iscsid_t:s0 tclass=capability permissive=0
Dec 09 18:05:01.791499 okd-node-02.okd.example.com kernel: audit: type=1400 audit(1670609101.789:6001): avc: denied { dac_override } for pid=216728 comm="iscsiadm" capability=1 scontext=system_u:system_r:iscsid_t:s0 tcontext=system_u:system_r:iscsid_t:s0 tclass=capability permissive=0
Dec 09 18:05:01.791818 okd-node-02.okd.example.com kernel: audit: type=1300 audit(1670609101.789:6001): arch=c000003e syscall=83 success=no exit=-13 a0=559752146390 a1=1f8 a2=ffffffffffffff00 a3=0 items=0 ppid=216727 pid=216728 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iscsiadm" exe="/usr/sbin/iscsiadm" subj=system_u:system_r:iscsid_t:s0 key=(null)
Dec 09 18:05:01.792035 okd-node-02.okd.example.com kernel: audit: type=1327 audit(1670609101.789:6001): proctitle=2F7362696E2F697363736961646D002D6D00646973636F766572796462002D740073656E6474617267657473002D70003137322E33302E3136302E3232350033323630002D490064656661756C74002D2D646973636F766572
Dec 09 18:05:01.789000 okd-node-02.okd.example.com audit[216728]: SYSCALL arch=c000003e syscall=83 success=no exit=-13 a0=559752146390 a1=1f8 a2=ffffffffffffff00 a3=0 items=0 ppid=216727 pid=216728 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iscsiadm" exe="/usr/sbin/iscsiadm" subj=system_u:system_r:iscsid_t:s0 key=(null)
Dec 09 18:05:01.789000 okd-node-02.okd.example.com audit: PROCTITLE proctitle=2F7362696E2F697363736961646D002D6D00646973636F766572796462002D740073656E6474617267657473002D70003137322E33302E3136302E3232350033323630002D490064656661756C74002D2D646973636F766572
Dec 09 18:05:01.875494 okd-node-02.okd.example.com hyperkube[1888]: E1209 18:05:01.875357 1888 csi_attacher.go:344] kubernetes.io/csi: attacher.MountDevice failed: rpc error: code = Internal desc = failed to find device path: [], last error seen: failed to sendtargets to portal 172.30.160.225:3260, err: iscsiadm error: iscsiadm: No records found (exit status 21)
Dec 09 18:05:01.877426 okd-node-02.okd.example.com hyperkube[1888]: E1209 18:05:01.876036 1888 nestedpendingoperations.go:348] Operation for "{volumeName:kubernetes.io/csi/cstor.csi.openebs.io^pvc-1ad97794-f713-453e-8044-3b6605abd75c podName: nodeName:}" failed. No retries permitted until 2022-12-09 18:07:03.875972643 +0000 UTC m=+9091.515017073 (durationBeforeRetry 2m2s). Error: MountVolume.MountDevice failed for volume "pvc-1ad97794-f713-453e-8044-3b6605abd75c" (UniqueName: "kubernetes.io/csi/cstor.csi.openebs.io^pvc-1ad97794-f713-453e-8044-3b6605abd75c") pod "example-fcos-moderate-infra-rs-76b58ff799-ntwg5" (UID: "bbca7b2b-f3ec-4ffb-9616-fb675357e935") : rpc error: code = Internal desc = failed to find device path: [], last error seen: failed to sendtargets to portal 172.30.160.225:3260, err: iscsiadm error: iscsiadm: No records found (exit status 21)
vrutkovs commented 1 year ago

Package diff:

Upgraded:

  aardvark-dns 1.2.0-6.fc36 -> 1.3.0-1.fc36
  amd-gpu-firmware 20221012-141.fc36 -> 20221109-144.fc36
  avahi-libs 0.8-15.fc36 -> 0.8-16.fc36
  bash 5.2.2-2.fc36 -> 5.2.9-2.fc36
  btrfs-progs 6.0-1.fc36 -> 6.0.2-1.fc36
  conmon 2:2.1.4-3.fc36 -> 2:2.1.5-1.fc36
  container-selinux 2:2.191.0-1.fc36 -> 2:2.193.0-1.fc36
  curl 7.82.0-9.fc36 -> 7.82.0-11.fc36
  gnutls 3.7.8-2.fc36 -> 3.7.8-3.fc36
  grub2-common 1:2.06-54.fc36 -> 1:2.06-57.fc36
  grub2-efi-x64 1:2.06-54.fc36 -> 1:2.06-57.fc36
  grub2-pc 1:2.06-54.fc36 -> 1:2.06-57.fc36
  grub2-pc-modules 1:2.06-54.fc36 -> 1:2.06-57.fc36
  grub2-tools 1:2.06-54.fc36 -> 1:2.06-57.fc36
  grub2-tools-minimal 1:2.06-54.fc36 -> 1:2.06-57.fc36
  intel-gpu-firmware 20221012-141.fc36 -> 20221109-144.fc36
  kernel 6.0.8-200.fc36 -> 6.0.10-200.fc36
  kernel-core 6.0.8-200.fc36 -> 6.0.10-200.fc36
  kernel-modules 6.0.8-200.fc36 -> 6.0.10-200.fc36
  krb5-libs 1.19.2-11.fc36 -> 1.19.2-12.fc36
  libatomic 12.2.1-2.fc36 -> 12.2.1-4.fc36
  libcurl 7.82.0-9.fc36 -> 7.82.0-11.fc36
  libgcc 12.2.1-2.fc36 -> 12.2.1-4.fc36
  libgomp 12.2.1-2.fc36 -> 12.2.1-4.fc36
  libnghttp2 1.46.0-2.fc36 -> 1.51.0-1.fc36
  libsmbclient 2:4.16.6-0.fc36 -> 2:4.16.7-0.fc36
  libstdc++ 12.2.1-2.fc36 -> 12.2.1-4.fc36
  libwbclient 2:4.16.6-0.fc36 -> 2:4.16.7-0.fc36
  libxcrypt 4.4.30-1.fc36 -> 4.4.33-1.fc36
  linux-firmware 20221012-141.fc36 -> 20221109-144.fc36
  linux-firmware-whence 20221012-141.fc36 -> 20221109-144.fc36
  netavark 1.2.0-5.fc36 -> 1.3.0-1.fc36
  nvidia-gpu-firmware 20221012-141.fc36 -> 20221109-144.fc36
  podman 4:4.3.0-2.fc36 -> 4:4.3.1-1.fc36
  podman-plugins 4:4.3.0-2.fc36 -> 4:4.3.1-1.fc36
  python-pip-wheel 21.3.1-3.fc36 -> 21.3.1-4.fc36
  python-setuptools-wheel 59.6.0-2.fc36 -> 59.6.0-3.fc36
  python3-libs 3.10.8-1.fc36 -> 3.10.8-3.fc36
  rpm-ostree 2022.15-3.fc36 -> 2022.16-1.fc36
  rpm-ostree-libs 2022.15-3.fc36 -> 2022.16-1.fc36
  samba-client-libs 2:4.16.6-0.fc36 -> 2:4.16.7-0.fc36
  samba-common 2:4.16.6-0.fc36 -> 2:4.16.7-0.fc36
  samba-common-libs 2:4.16.6-0.fc36 -> 2:4.16.7-0.fc36
  vim-data 2:9.0.828-1.fc36 -> 2:9.0.963-1.fc36
  vim-minimal 2:9.0.828-1.fc36 -> 2:9.0.963-1.fc36

Most likely its either

  container-selinux 2:2.191.0-1.fc36 -> 2:2.193.0-1.fc36

or

  rpm-ostree 2022.15-3.fc36 -> 2022.16-1.fc36

@cgwalters could you check if that is not an rpm-ostree regression?

ArthurVardevanyan commented 1 year ago

Similar behavior using Longhorn: https://github.com/longhorn/longhorn/issues/4988

AlexanderWurz commented 1 year ago

Similar issue when using istio: https://github.com/istio/istio/issues/42485 - for some reason SELinux now behaves differently

netwarex commented 1 year ago

Is 4.11.0-0.okd-2023-01-14-152430 fixes that?

vrutkovs commented 1 year ago

Workaround from longhorn bug: https://github.com/longhorn/longhorn/issues/4988#issuecomment-1345676772 (apparently its applicable for iscsi too).

Not sure if its due to app not requesting dac_override or a genuine Fedora bug - lets report it to for container-selinux package in Fedora?

sfritze commented 1 year ago

We experience the same issue on 4.11.0-0.okd-2022-12-02-145640 using Netapp Trident v22.10 as storage backend. Event Message from a Pod trying to use an iSCSI backed PVC shows "iSCSI Login failed".

What i don't understand: the security context for a working directory in /var/lib/iscsi/nodes is the same as the not working directory. The Filesytem looks like this on the node:

sudo ls -al -Z /var/lib/iscsi/nodes/
total 4
drwxr-xr-x. 6 root root system_u:object_r:iscsi_var_lib_t:s0 4096 Jan 18 15:28 .
drwxr-xr-x. 8 root root system_u:object_r:iscsi_var_lib_t:s0   90 Nov 14 14:07 ..
drw-------. 6 root root system_u:object_r:iscsi_var_lib_t:s0  130 Jan 18 15:38 iqn.1992-08.com.netapp:sn.6f75e51c7a2411ed9b05d039ea43322c:vs.73
drw-------. 2 root root system_u:object_r:iscsi_var_lib_t:s0    6 Jan 18 15:24 iqn.1992-08.com.netapp:sn.70b0eecd967c11ed9b05d039ea43322c:vs.74
drw-------. 2 root root system_u:object_r:iscsi_var_lib_t:s0    6 Jan 18 15:12 iqn.1992-08.com.netapp:sn.c978523d674c11ed9b05d039ea43322c:vs.71

The working directory is ending with vs.73 and was manually created via

iscsiadm -m discoverydb -t st -p 10.32.148.206:3260 -I default -D

After creating the directory via the mentioned command, everything works fine. Error messages regarding iscsiadm form selinux:

[ 1452.470079] audit: type=1400 audit(1674055739.214:3822): avc:  denied  { dac_override } for  pid=44518 comm="iscsiadm" capability=1  scontext=system_u:system_r:iscsid_t:s0 tcontext=system_u:system_r:iscsid_t:s0 tclass=capability permissive=0
[ 1452.470082] audit: type=1300 audit(1674055739.214:3822): arch=c000003e syscall=83 success=no exit=-13 a0=55c4daf2b400 a1=1f8 a2=ffffffffffffff00 a3=0 items=0 ppid=3030 pid=44518 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="iscsiadm" exe="/usr/sbin/iscsiadm" subj=system_u:system_r:iscsid_t:s0 key=(null)
ArthurVardevanyan commented 1 year ago

Here is more testing information, the 4.12 CI Branch was working up until the release was cut for 4-stable:

I didn't test: CI: 4.12.0-0.okd-2023-01-20-161603, but it looks like the same build as: Stable 4.12.0-0.okd-2023-01-21-055900

REF: https://amd64.origin.releases.ci.openshift.org

vrutkovs commented 1 year ago

Thanks! Right before the release we switched from FCOS next-devel as a base to FCOS stable (see 4.12.0-0.okd-2023-01-21-055900 -> 4.12.0-0.okd-2023-01-20-101927 changelog). Most likely its container-selinux 2:2.193.0-1.fc37.noarch → 2:2.198.0-1.fc37.noarch. That means the fix should be coming in the next FCOS stable bump.

Also, in 4.12 you can now create your own OS image and include FCOS testing fixes sooner

AlexanderWurz commented 1 year ago

but will there be a fix that does not need to put SELinux to permissive with OKD 4.11? or will this only be tackled in 4.12?

vrutkovs commented 1 year ago

I can build another machine-os-content for OKD 4.11, but we can't push it to stable channel anymore

sfritze commented 1 year ago

but will there be a fix that does not need to put SELinux to permissive with OKD 4.11? or will this only be tackled in 4.12?

This may only help for external iSCSI targets but if you know the portal ip you can do a discovery on all relevant nodes via: iscsiadm -m discoverydb -t st -p <portl-ip>:3260 -I default -D This creates the folder correctly and you do not need to set selinux to permissive.

AlexanderWurz commented 1 year ago

I can build another machine-os-content for OKD 4.11, but we can't push it to stable channel anymore

Thanks, in that case we will take a 4.12 release in stable channel then once it is out - we tested the first 4.12 stable release which still has the SELinux issue, so I guess it will be solved in one of the other upcoming ones.

ceagan commented 1 year ago

This issue is still present in Fedora Core 37.20230110.3.1 for us, which is packaged with OKD 4.12.0-0.okd-2023-02-04-212953.

netwarex commented 1 year ago

For a temporary fix, I have wrote an article (this is a fix specially for iscsiadm, where dac_override is not enabled), however with small change it can be used to fix other permissions without disabling SELinux:

https://ioflair.com/blog/fix-longhorn-volumes-stuck-in-attach-detach-loop-on-openshift-okd/

vrutkovs commented 1 year ago

Merged @netwarex's fix (https://github.com/openshift/okd-machine-os/pull/541), should be available in the next 4.12 release

netwarex commented 1 year ago

@vrutkovs this won't fix in 4.11, or no more 4.11 OKD coming?

vrutkovs commented 1 year ago

No more 4.11 stables coming (nightlies would still be released of course). I don't mind cherry-picking it to 4.11 but we'd need a confirmation its fixed in 4.12 first

vrutkovs commented 1 year ago

Fix available in amd64.origin.releases.ci.openshift.org/releasestream/4-stable/release/4.12.0-0.okd-2023-03-05-022504

Keeping this open to confirm its fixed before cherrypicking to 4.11 nightlies

AlexanderWurz commented 1 year ago

This fix may only solve the volumes, not the network related issues when using istio service mesh, as indicated here https://github.com/istio/istio/issues/42485

unfortunately we still cannot test 4.12 as we still need to migrate our apis for kubernetes 1.25.

vrutkovs commented 1 year ago

Reopened #1450 to track istio exception, lets continue there

JaimeMagiera commented 1 month ago

Hi,

We are not working on FCOS builds of OKD any more. Please see these documents...

https://okd.io/blog/2024/06/01/okd-future-statement https://okd.io/blog/2024/07/30/okd-pre-release-testing

Please test with the OKD SCOS nightlies and file a new issue as needed.

Many thanks,

Jaime