intel / intel-device-plugins-for-kubernetes

Collection of Intel device plugins for Kubernetes
Apache License 2.0
45 stars 204 forks source link

OCP: SeLinux issue on OpenShif-4.9 to run SGX on intel-device-plugins framework #762

Closed Walnux closed 2 years ago

Walnux commented 2 years ago

Comments

The issue is:

If I enable SeLinux up as below on my work node

sh-4.4# sestatus
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   enforcing
Mode from config file:          enforcing
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Memory protection checking:     actual (secure)
Max kernel policy version:      33

Then My initial container will run into "permission access denied" issue on all the volume mounted in the pod if I close the Selinux as below

sh-4.4# sestatus
SELinux status:                 enabled
SELinuxfs mount:                /sys/fs/selinux
SELinux root directory:         /etc/selinux
Loaded policy name:             targeted
Current mode:                   permissive
Mode from config file:          enforcing
Policy MLS status:              enabled
Policy deny_unknown status:     allowed
Memory protection checking:     actual (secure)
Max kernel policy version:      33

The operator can be up and running properly. You can reproduce the issue using the below steps

Reproduce Steps

Firstly I have to apply below patches to setup SCC according to documents: SCC in OCP-4.9 Guide to UID, GID

Author: MartinXu <martin.xu@intel.com>
Date:   Thu Nov 18 22:29:51 2021 -0500

    Add SCC hostaccess to manager-role on OpenShift

    So the default SA (Service Account) can have the privilige to create pod to access to all
    host namespaces but still requires pods to be run with a UID and SELinux context that are
    allocated to the namespace.

    For detail
    see https://docs.openshift.com/container-platform/4.9/authentication/managing-security-context-constraints.html

diff --git a/deployments/operator/rbac/role.yaml b/deployments/operator/rbac/role.yaml
index 8d19b7a..dd93674 100644
--- a/deployments/operator/rbac/role.yaml
+++ b/deployments/operator/rbac/role.yaml
@@ -176,3 +176,11 @@ rules:
   - get
   - list
   - watch
+- apiGroups:
+  - security.openshift.io
+  resources:
+  - securitycontextconstraints
+  resourceNames:
+  -  hostmount-anyuid
+  verbs:
+  - use
commit 9e3106cef687a7f83ed7daed90575f7e16b16993
Author: Xu <jxu36@jfz1r09h07.otcdcslab.com>
Date:   Thu Nov 18 19:27:25 2021 -0500

    Dropoff securityContext from manager deployment

    OpenShift SCC (Security Context Constraints) is used to manage security
    context. See
    https://cloud.redhat.com/blog/a-guide-to-openshift-and-uids
    https://docs.openshift.com/container-platform/4.9/authentication/managing-security-context-constraints.html

    By default restricted SCC is used to Ensure that pods cannot be run as privileged.
    So this commit drops off the securityconttext to run as non-root user

diff --git a/deployments/operator/default/manager_auth_proxy_patch.yaml b/deployments/operator/default/manager_auth_proxy_patch.yaml
index 8ba668c..082782f 100644
--- a/deployments/operator/default/manager_auth_proxy_patch.yaml
+++ b/deployments/operator/default/manager_auth_proxy_patch.yaml
@@ -19,11 +19,11 @@ spec:
         ports:
         - containerPort: 8443
           name: https
-        securityContext:
-          runAsNonRoot: true
-          runAsUser: 1000
-          runAsGroup: 1000
-          readOnlyRootFilesystem: true
+          #securityContext:
+          #runAsNonRoot: true
+          #runAsUser: 1000
+          #runAsGroup: 1000
+          #readOnlyRootFilesystem: true
       - name: manager
         args:
         - "--metrics-addr=127.0.0.1:8080"
diff --git a/deployments/operator/manager/manager.yaml b/deployments/operator/manager/manager.yaml
index db335d3..9ee0a94 100644
--- a/deployments/operator/manager/manager.yaml
+++ b/deployments/operator/manager/manager.yaml
@@ -33,11 +33,11 @@ spec:
           requests:
             cpu: 100m
             memory: 20Mi
-        securityContext:
-          runAsNonRoot: true
-          runAsUser: 65532
-          runAsGroup: 65532
-          readOnlyRootFilesystem: true
+        #securityContext:
+        #runAsNonRoot: true
+        #runAsUser: 65532
+        #runAsGroup: 65532
+        #readOnlyRootFilesystem: true
         env:
           - name: DEVICEPLUGIN_NAMESPACE
             valueFrom:
commit fbf8bd8b120ab65fc456d4778fb156214230ffac
Author: MartinXu <martin.xu@intel.com>
Date:   Thu Nov 18 20:45:51 2021 -0500

    Backport https://github.com/intel/intel-device-plugins-for-kubernetes/pull/756

diff --git a/deployments/operator/rbac/role.yaml b/deployments/operator/rbac/role.yaml
index 3e490e5..8d19b7a 100644
--- a/deployments/operator/rbac/role.yaml
+++ b/deployments/operator/rbac/role.yaml
@@ -143,6 +143,12 @@ rules:
   - patch
   - update
   - watch
+- apiGroups:
+  - deviceplugin.intel.com
+  resources:
+  - sgxdeviceplugins/finalizers
+  verbs:
+  - update
 - apiGroups:
   - deviceplugin.intel.com
   resources:

run operator manually

Then start the intel device plugins framework using command $ oc apply -k intel-device-plugins-for-kubernetes/deployments/operator/default/ and start SGX pluin DS as oc apply -f intel-device-plugins-for-kubernetes/deployments/operator/samples/deviceplugin_v1_sgxdeviceplugin.yaml

The intel device plugins framework can up and running, and the SGX plugin DS also up and running. But the init container in the pod run into the "permission access denied issue" when try to access directory /etc/kubernetes/node-feature-discovery/source.d/

Run operator though OLM

You can also run the operator through OLM operator-sdk run bundle docker.io/walnuxdocker/intel-device-plugins-operator-bundle:0.22.0 The result is the same with run manually this is the volume mounted in the pod

 nodeSelector:
    feature.node.kubernetes.io/custom-intel.sgx: 'true'
    kubernetes.io/arch: amd64
  restartPolicy: Always
  initContainers:
    - name: intel-sgx-initcontainer
      image: 'intel/intel-sgx-initcontainer:0.22.0'
      resources: {}
      volumeMounts:
        - name: nfd-source-hooks
          mountPath: /etc/kubernetes/node-feature-discovery/source.d/
        - name: kube-api-access-nkpq6
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
      securityContext:
        capabilities:
          drop:
            - MKNOD
        readOnlyRootFilesystem: true

Analysis:

You can see that I assigned the SCC as hostmount-anyuid. And after I disabled the Selinux with command on worknode 1 with command $sudo setenforce 0 Operator up and run on this node. But I leave Selinux enable on worknode 0 "The permission access denied issue still there"

After I set the SCC as hostaccess, no matter I disable or enable the SeLinux, The permission access denied issue always happens.

The proper way to access shared directory in pod

mountPath: '/etc/kubernetes/node-feature-discovery/source.d/:z' and using SCC hostmount-anyuid, looks like above issue can be resolved the init container can work with Selinux set as enforcing mode. the root cause is: According to https://www.redhat.com/sysadmin/user-namespaces-selinux-rootless-containers The root cause might be:

The container engine, Podman, launches each container with a unique process SELinux label (usually container_t) and labels all of the container content with a single label (usually container_file_t). We have rules that state that container_t can read and write all content labeled container_file_t. This simple idea has blocked major file system exploits.

Everything works perfectly until the user attempts a volume mount. The problem with volumes is that they usually only bind mounts on the host. They bring in the labels from the host, which the SELinux policy does not allow the process label to interact with, and the container blows up.

However the sgxplugin container runinto permission access deny issue

 initContainers:
    - name: intel-sgx-initcontainer
      image: 'intel/intel-sgx-initcontainer:0.22.0'
      resources: {}
      volumeMounts:
        - name: nfd-source-hooks
          mountPath: '/etc/kubernetes/node-feature-discovery/source.d/:z'
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
      securityContext:
        readOnlyRootFilesystem: false

The error is: E1130 05:11:07.898395 1 sgx_plugin.go:75] No SGX enclave file available: stat /dev/sgx_enclave: permission denied

Try to resolve the above issue using the similar way to mount /dev/sgx_enclave with :z

  containers:
    - resources: {}
      terminationMessagePath: /dev/termination-log
      name: intel-sgx-plugin
      securityContext:
        readOnlyRootFilesystem: false
      imagePullPolicy: IfNotPresent
      volumeMounts:
        - name: sgxdevices
          mountPath: /dev/sgx
        - name: sgx-enclave
          mountPath: '/dev/sgx_enclave:z'

It runs into below error sgx_plugin.go:75] No SGX enclave file available: stat /dev/sgx_enclave: no such file or directory

The proper way to access host devices from the container

After I use SCC privileged, and set privileged: true

 containers:
        - resources: {}
          terminationMessagePath: /dev/termination-log
          name: intel-sgx-plugin
          securityContext:
            privileged: true

above issue can be resolved.

according to https://kubernetes.io/docs/concepts/policy/pod-security-policy/ a "privileged" container is given access to all devices on the host. This allows the container nearly all the same access as processes running on the host. This is useful for containers that want to use linux capabilities like manipulating the network stack and accessing devices.

I am concerned about using this privilege right And others also has the similar concern and request a new feature in K8S See https://github.com/kubernetes/kubernetes/issues/60748

However, since the SGX device plugin has to access the SGX devices of host, looks like we can only use the privileged container. @mythi What's your comments? :)

reference to similar project like SRO

In Special resource operator, looks like the similar security policy is applied https://github.com/openshift/special-resource-operator/blob/master/charts/xilinx/fpga-xrt-driver-4.7.11/templates/1000-driver-container.yaml#L17

https://github.com/openshift/special-resource-operator/blob/master/charts/xilinx/fpga-xrt-driver-4.7.11/templates/1000-driver-container.yaml#L70

mythi commented 2 years ago

the sgx operator could create said machineconfig and trigger a reboot, then the device would be available on the next reboot

@haircommander OK, this sounds like a reasonable workaround until the problem gets fixed in the next/future release. However, I think we should try to leverage SRO+MCO operators for this and not add the functionality into the device plugins operator.

haircommander commented 2 years ago

gotcha, then maybe SRO would be a good fit for this. Is there a registration of the SGX plugin in the SRO? ideally this unit would only be run when SGX device is enabled and installing.

Walnux commented 2 years ago

@haircommander if we use SRO, should we install the policy from a container? If we can package the policy into a container and install it through a container, we can use the standard way to release and install policy on OCP. I am trying to do that. Do you guys know someone ever try it before? And looks like most of the people suggest to install the policy now is from rpm package, in this case, we can leverage MCO.

haircommander commented 2 years ago

not for policy but I do know privileged containers are used to configure things on the node. However, it's usually on startup from what I know. Something to think about: if we're using a privileged container to create a file on the host, is that much different from having the SGX plugin container being privileged?

mythi commented 2 years ago

Something to think about: if we're using a privileged container to create a file on the host, is that much different from having the SGX plugin container being privileged?

@haircommander we currently have 6 plugins supported by the operator so I guess having one centralized one run as privileged is better than having to run all those 6. AFAIU, this would also be a stop gap until it's possible to deploy plugins without having to configure these labels separately.

haircommander commented 2 years ago

good points, makes sense to me. @rhatdan if an selinux policy is configured, does the node need to be rebooted for it to take effect? (it's possible rhcos also behaves differently in this case, in which case we may need the reboot anyway)

rhatdan commented 2 years ago

No SELinux does not require a reboot, as long as it was enabled in the first place. Policy is instantly applied, and labels are placed on disk by restorecon.

mregmi commented 2 years ago

@rhatdan One other issue we encountered is that it looks like the socket communication is not allowed between containers. our plugins use this to communicate. we had to manually create a selinux policy to allow it. Is there a way to allow this without deploying custom selinux policy. we used a policy something like:

============= container_t ==============

allow container_t container_runtime_t:unix_stream_socket connectto;

rhatdan commented 2 years ago

What is running as container_runtime_t? The intel-device-plugin?

mregmi commented 2 years ago

the SGX plugin is running as container_t. we got that policy from audit2allow.

rhatdan commented 2 years ago

The allow rule above shows a container attempting to connectto a process running as container_runtime_t, which is the label of the container engine like Podman or CRI-O.

mregmi commented 2 years ago

Thats strange. we saw the log below in audit log and we ran the audit2allow and it gave that rule.

/var/log/audit/audit.log.1:type=AVC msg=audit(1648502191.123:87396): avc: denied { connectto } for pid=1514382 comm="intel_sgx_devic" path="/var/lib/kubelet/device-plugins/kubelet.sock" scontext=system_u:system_r:container_t:s0:c149,c701 tcontext=system_u:system_r:container_runtime_t:s0 tclass=unix_stream_socket permissive=0

i just checked the plugin is container_t. its strange the rule came out as container_runtime_t.

sh-4.4# ps -AZ | grep intel_sgx system_u:system_r:container_t:s0:c612,c793 3927534 ? 00:00:39 intel_sgx_devic

rhatdan commented 2 years ago

See if you can create the AVC again. It might have been an older test.

mregmi commented 2 years ago

I tried it several times but the policy audit2allow gives is the same. We have modified the policy a bit and now we created a new domain/process label for our plugin and gave permission to this label. sh-4.4# ps -AZ | grep intel system_u:system_r:container_t:s0:c19,c27 706721 ? 00:08:45 intel_deviceplu system_u:system_r:intelplugins_t:s0:c545,c815 3769827 ? 00:00:00 intel_sgx_devic

type=AVC msg=audit(1649881114.712:151954): avc: denied { connectto } for pid=3736904 comm="intel_sgx_devic" path="/var/lib/kubelet/device-plugins/kubelet.sock" scontext=system_u:system_r:intelplugins_t:s0:c131,c171 tcontext=system_u:system_r:container_runtime_t:s0 tclass=unix_stream_socket permissive=1

the new policy looks something like this.

policy_module(intelplugins, 1.0)

gen_require(`
        type container_file_t;
        type device_t;
')

container_domain_template(intelplugins)

#============= intelplugins_t ==============
allow intelplugins_t container_runtime_t:unix_stream_socket connectto;
allow intelplugins_t device_t:chr_file getattr;
Walnux commented 2 years ago

This issue has been fixed in https://github.com/containers/container-selinux/pull/178 So close it. :)