intel / pmem-csi

Persistent Memory Container Storage Interface Driver
Apache License 2.0
164 stars 55 forks source link

Directory permission issue when using DaemonSet and PMEM-CSI on OpenShift 4.6.9 #912

Open Tianyang-Zhang opened 3 years ago

Tianyang-Zhang commented 3 years ago

I created a local PV and PVC with local storage class(no provisioner) and readWriteMany access mode for storage sharing between pods:

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: shared-volume
spec:
  capacity:
    storage: 8Gi
  accessModes:
  - ReadWriteMany
  storageClassName: local-storage
  local:
    path: /tmp
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: storage
          operator: In
          values:
          - pmem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-volume-claim
spec:
  storageClassName: local-storage
  volumeName: shared-volume
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 8Gi

Then I created a daemonSet mount to this volume(path /tmp/memverge). This daemonSet uses PMEM-CSI to provision PMEM by CSI ephemeral volume(I'm using OpenShift 4.6 and generic ephemeral volume somehow is not supported). Everything works fine and I can attach to my pods(say pod A) and access the mounted directory. But if I create another pod(say pod B, which is running on the same node as pod A) mounting to the same local PV, I no longer able to access /tmp/memverge in pod A and get error:

[root@memory-machine-mcz4z /]# ls /tmp/memverge/
ls: cannot open directory '/tmp/memverge/': Permission denied

The permission in container is correct:

[root@memory-machine-mcz4z /]# ls -l /tmp/
total 8
-rwx------.  1 root root 701 Dec  4 17:37 ks-script-esd4my7v
-rwx------.  1 root root 671 Dec  4 17:37 ks-script-eusq_sc5
drwxrwsrwt. 11 root root 520 Mar  5 23:12 memverge

If I create more pods mounting to the same local PV, all these pods works fine and I am able to access the mounted dir. But not the pod A.

If I remove the CSI ephemeral volume part in the daemonSet and re-do everything, this issue is gone. The volume spec for PMEM-CSI is as following:

volumes:
      - name: pmem-csi-ephemeral-volume
        csi:
          driver: pmem-csi.intel.com
          fsType: "xfs"
          volumeAttributes:
            size: "20Gi"

This issue seems only happens when daemonSet is involved. I haven't do

pohly commented 3 years ago

This smells like an issue in the container runtime, potentially related to SELinux.

Can you reproduce it with SELinux disabled?

Can you reproduce it when replacing PMEM-CSI with some other CSI driver, for example https://github.com/kubernetes-csi/csi-driver-host-path?

pohly commented 3 years ago

I tried to reproduce this on our QEMU cluster, but without success. it worked:

pohly@pohly-desktop:/nvme/gopath/src/github.com/intel/pmem-csi$ kubectl get pods -o wide
NAME          READY   STATUS    RESTARTS   AGE     IP                NODE                         NOMINATED NODE   READINESS GATES
pod-b         1/1     Running   0          13s     192.168.200.68    pmem-csi-pmem-govm-worker3   <none>           <none>
sleep-qkzxr   1/1     Running   0          4m13s   192.168.200.67    pmem-csi-pmem-govm-worker3   <none>           <none>
sleep-rj7qx   1/1     Running   0          4m13s   192.168.133.132   pmem-csi-pmem-govm-worker1   <none>           <none>
sleep-ssrs7   1/1     Running   0          4m13s   192.168.220.67    pmem-csi-pmem-govm-worker2   <none>           <none>
pohly@pohly-desktop:/nvme/gopath/src/github.com/intel/pmem-csi$ kubectl exec pod-b -- ls /tmp/memverge
runc-process670585825
systemd-private-b887230389c949ce9a1d9e64bdcec54b-chronyd.service-xJQLMi
systemd-private-b887230389c949ce9a1d9e64bdcec54b-dbus-broker.service-vVaVSe
systemd-private-b887230389c949ce9a1d9e64bdcec54b-systemd-logind.service-ko8uti
systemd-private-b887230389c949ce9a1d9e64bdcec54b-systemd-resolved.service-jsPHjj
pohly@pohly-desktop:/nvme/gopath/src/github.com/intel/pmem-csi$ kubectl exec sleep-qkzxr -- ls /tmp/memverge
runc-process561773745
systemd-private-b887230389c949ce9a1d9e64bdcec54b-chronyd.service-xJQLMi
systemd-private-b887230389c949ce9a1d9e64bdcec54b-dbus-broker.service-vVaVSe
systemd-private-b887230389c949ce9a1d9e64bdcec54b-systemd-logind.service-ko8uti
systemd-private-b887230389c949ce9a1d9e64bdcec54b-systemd-resolved.service-jsPHjj
pohly@pohly-desktop:/nvme/gopath/src/github.com/intel/pmem-csi$ kubectl exec sleep-qkzxr -- touch /tmp/memverge/foo
pohly@pohly-desktop:/nvme/gopath/src/github.com/intel/pmem-csi$ kubectl exec pod-b -- ls /tmp/memverge
foo
runc-process553107404
systemd-private-b887230389c949ce9a1d9e64bdcec54b-chronyd.service-xJQLMi
systemd-private-b887230389c949ce9a1d9e64bdcec54b-dbus-broker.service-vVaVSe
systemd-private-b887230389c949ce9a1d9e64bdcec54b-systemd-logind.service-ko8uti
systemd-private-b887230389c949ce9a1d9e64bdcec54b-systemd-resolved.service-jsPHjj
pohly@pohly-desktop:/nvme/gopath/src/github.com/intel/pmem-csi$ kubectl exec sleep-qkzxr -- mount
overlay on / type overlay (rw,seclabel,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/55/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/56/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/56/work)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
devpts on /dev/pts type devpts (rw,seclabel,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,seclabel,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (ro,seclabel,nosuid,nodev,noexec,relatime)
tmpfs on /sys/fs/cgroup type tmpfs (ro,seclabel,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,cpuset)
/dev/ndbus0region0fsdax/csi-66-bcbfbd4fad181ad3a7f1eb7d641e996ba48246a2f8e0ec39bc54b489 on /pmem type xfs (rw,seclabel,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
tmpfs on /tmp/memverge type tmpfs (rw,seclabel,nr_inodes=409600)
/dev/vda1 on /etc/hosts type ext4 (rw,seclabel,relatime)
/dev/vda1 on /dev/termination-log type ext4 (rw,seclabel,relatime)
/dev/vda1 on /etc/hostname type ext4 (rw,seclabel,relatime)
/dev/vda1 on /etc/resolv.conf type ext4 (rw,seclabel,relatime)
shm on /dev/shm type tmpfs (rw,seclabel,nosuid,nodev,noexec,relatime,size=65536k)
tmpfs on /var/run/secrets/kubernetes.io/serviceaccount type tmpfs (ro,seclabel,relatime)
proc on /proc/bus type proc (ro,relatime)
proc on /proc/fs type proc (ro,relatime)
proc on /proc/irq type proc (ro,relatime)
proc on /proc/sys type proc (ro,relatime)
proc on /proc/sysrq-trigger type proc (ro,relatime)
tmpfs on /proc/acpi type tmpfs (ro,seclabel,relatime)
tmpfs on /proc/kcore type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/keys type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/latency_stats type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/sched_debug type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,seclabel,relatime)
tmpfs on /sys/firmware type tmpfs (ro,seclabel,relatime)
pohly commented 3 years ago

Here are the objects that I used. Local volume (same as in description):

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: local-storage
provisioner: kubernetes.io/no-provisioner
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: shared-volume
spec:
  capacity:
    storage: 8Gi
  accessModes:
  - ReadWriteMany
  storageClassName: local-storage
  local:
    path: /tmp
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: storage
          operator: In
          values:
          - pmem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: shared-volume-claim
spec:
  storageClassName: local-storage
  volumeName: shared-volume
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 8Gi

Daemonset:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: sleep
spec:
  selector:
    matchLabels:
      name: sleep
  template:
    metadata:
      labels:
        name: sleep
    spec:
      containers:
      - name: sleep
        image: busybox
        command:
          - sleep
          - "1000000"
        volumeMounts:
        - name: memverge
          mountPath: /tmp/memverge
        - name: pmem-csi-ephemeral-volume
          mountPath: /pmem
      volumes:
      - name: memverge
        persistentVolumeClaim:
          claimName: shared-volume-claim
      - name: pmem-csi-ephemeral-volume
        csi:
          driver: pmem-csi.intel.com
          fsType: "xfs"
          volumeAttributes:
            size: "100Mi"

Pod:

apiVersion: v1
kind: Pod
metadata:
  name: pod-b
spec:
  containers:
  - name: sleep
    image: busybox
    command:
      - sleep
      - "1000000"
    volumeMounts:
    - name: memverge
      mountPath: /tmp/memverge
  volumes:
    - name: memverge
      persistentVolumeClaim:
        claimName: shared-volume-claim
pohly commented 3 years ago

Does it perhaps matter where volumes are mounted inside the containers?

I would avoid mounting volumes on top of each other, if it can be avoided. I don't have a particular reason, it just seems unnecessarily complicated.

pohly commented 3 years ago

For hostpath, distributed provisioning from v1.6.0 would be needed to get all pods of the DaemonSet running. But it looks like I broke CSI ephemeral volume support in that driver when adding capacity simulation in that release. Somehow that didn't show up in tests... because CSI ephemeral volume support is not tested with that driver. Will fix both.

To use hostpath:

If you don't want to build yourself, you can also use the image that I pushed and directly deploy.

Then use this DaemonSet:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: sleep
spec:
  selector:
    matchLabels:
      name: sleep
  template:
    metadata:
      labels:
        name: sleep
    spec:
      containers:
      - name: sleep
        image: busybox
        command:
          - sleep
          - "1000000"
        volumeMounts:
        - name: memverge
          mountPath: /tmp/memverge
        - name: pmem-csi-ephemeral-volume
          mountPath: /pmem
      volumes:
      - name: memverge
        persistentVolumeClaim:
          claimName: shared-volume-claim
      - name: pmem-csi-ephemeral-volume
        csi:
          driver: hostpath.csi.k8s.io
Tianyang-Zhang commented 3 years ago

Thanks for the information. I forgot to mention that the issue was found in the OpenShift environment. I'm not sure if this is caused by OpenShift.

I will try not mounting on the same path.

pohly commented 3 years ago

@Tianyang-Zhang Did using different paths help?

Tianyang-Zhang commented 3 years ago

Sorry about the late update. I tried using a different path(/home/shared) but still having this issue. The SELinux was disabled.

[root@memory-machine-28ql5 /]# ls -l /home/shared/
ls: cannot open directory '/home/shared/': Permission denied
[root@memory-machine-28ql5 /]# ls -l /home/
total 0
drwxr-xr-x. 3 root root       22 Apr  9 00:08 etc
drwxr-xr-x. 1 root root       29 Apr  9 00:08 memverge
drwxr-xr-x. 3 root root       22 Apr  9 00:08 opt
drwxrwsr-x. 3 root 1000960000 81 Apr  9 23:15 shared
pohly commented 3 years ago

Can you reproduce it with the CSI hostpath driver instead of PMEM-CSI? v1.6.2 should work out of the box, i.e. no image building needed.

If yes, then this is something that can be reported to Red Hat.

Tianyang-Zhang commented 3 years ago

When I trying to create your daemonSet example, I got this error:

Normal   Scheduled         29s                default-scheduler  Successfully assigned injection/sleep-lvqhv to osc-5k68w-worker-9c42p
Warning  FailedMount       13s (x6 over 29s)  kubelet            MountVolume.NewMounter initialization failed for volume "pmem-csi-ephemeral-volume" : volume mode "Ephemeral" not supported by driver hostpath.csi.k8s.io (no CSIDriver object)

Should I build a image from source?

pohly commented 3 years ago

How did you install the CSI hostpath driver? When you install via https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/deploy/kubernetes-distributed/deploy.sh, then it should install a CSIDriver object from https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/deploy/kubernetes-distributed/hostpath/csi-hostpath-driverinfo.yaml

Tianyang-Zhang commented 3 years ago

I rechecked the whole cluster system and found selinux was re-enabled. This issue is gone after disabled selinux. Sorry about the confusion and extra time you spent!

pohly commented 3 years ago

But the solution can't be "disable SELinux", right?

It might require some extra work, but ideally it should also work with SELinux enabled - whatever "it" is that was failing.

Tianyang-Zhang commented 3 years ago

But the solution can't be "disable SELinux", right?

It might require some extra work, but ideally it should also work with SELinux enabled - whatever "it" is that was failing.

You are right. It might related to how daemonSet works? I only met this issue when using daemonSet.

Tianyang-Zhang commented 3 years ago

FYI, we also reproduced this issue without using any CSI driver on Diamanti cluster(k8s). Disable SELinux also fixed it.