Open Tianyang-Zhang opened 3 years ago
This smells like an issue in the container runtime, potentially related to SELinux.
Can you reproduce it with SELinux disabled?
Can you reproduce it when replacing PMEM-CSI with some other CSI driver, for example https://github.com/kubernetes-csi/csi-driver-host-path?
I tried to reproduce this on our QEMU cluster, but without success. it worked:
pohly@pohly-desktop:/nvme/gopath/src/github.com/intel/pmem-csi$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod-b 1/1 Running 0 13s 192.168.200.68 pmem-csi-pmem-govm-worker3 <none> <none>
sleep-qkzxr 1/1 Running 0 4m13s 192.168.200.67 pmem-csi-pmem-govm-worker3 <none> <none>
sleep-rj7qx 1/1 Running 0 4m13s 192.168.133.132 pmem-csi-pmem-govm-worker1 <none> <none>
sleep-ssrs7 1/1 Running 0 4m13s 192.168.220.67 pmem-csi-pmem-govm-worker2 <none> <none>
pohly@pohly-desktop:/nvme/gopath/src/github.com/intel/pmem-csi$ kubectl exec pod-b -- ls /tmp/memverge
runc-process670585825
systemd-private-b887230389c949ce9a1d9e64bdcec54b-chronyd.service-xJQLMi
systemd-private-b887230389c949ce9a1d9e64bdcec54b-dbus-broker.service-vVaVSe
systemd-private-b887230389c949ce9a1d9e64bdcec54b-systemd-logind.service-ko8uti
systemd-private-b887230389c949ce9a1d9e64bdcec54b-systemd-resolved.service-jsPHjj
pohly@pohly-desktop:/nvme/gopath/src/github.com/intel/pmem-csi$ kubectl exec sleep-qkzxr -- ls /tmp/memverge
runc-process561773745
systemd-private-b887230389c949ce9a1d9e64bdcec54b-chronyd.service-xJQLMi
systemd-private-b887230389c949ce9a1d9e64bdcec54b-dbus-broker.service-vVaVSe
systemd-private-b887230389c949ce9a1d9e64bdcec54b-systemd-logind.service-ko8uti
systemd-private-b887230389c949ce9a1d9e64bdcec54b-systemd-resolved.service-jsPHjj
pohly@pohly-desktop:/nvme/gopath/src/github.com/intel/pmem-csi$ kubectl exec sleep-qkzxr -- touch /tmp/memverge/foo
pohly@pohly-desktop:/nvme/gopath/src/github.com/intel/pmem-csi$ kubectl exec pod-b -- ls /tmp/memverge
foo
runc-process553107404
systemd-private-b887230389c949ce9a1d9e64bdcec54b-chronyd.service-xJQLMi
systemd-private-b887230389c949ce9a1d9e64bdcec54b-dbus-broker.service-vVaVSe
systemd-private-b887230389c949ce9a1d9e64bdcec54b-systemd-logind.service-ko8uti
systemd-private-b887230389c949ce9a1d9e64bdcec54b-systemd-resolved.service-jsPHjj
pohly@pohly-desktop:/nvme/gopath/src/github.com/intel/pmem-csi$ kubectl exec sleep-qkzxr -- mount
overlay on / type overlay (rw,seclabel,relatime,lowerdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/55/fs,upperdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/56/fs,workdir=/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/56/work)
proc on /proc type proc (rw,nosuid,nodev,noexec,relatime)
tmpfs on /dev type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
devpts on /dev/pts type devpts (rw,seclabel,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=666)
mqueue on /dev/mqueue type mqueue (rw,seclabel,nosuid,nodev,noexec,relatime)
sysfs on /sys type sysfs (ro,seclabel,nosuid,nodev,noexec,relatime)
tmpfs on /sys/fs/cgroup type tmpfs (ro,seclabel,nosuid,nodev,noexec,relatime,mode=755)
cgroup on /sys/fs/cgroup/systemd type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/blkio type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/freezer type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/memory type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/perf_event type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/devices type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/pids type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/cpuset type cgroup (ro,seclabel,nosuid,nodev,noexec,relatime,cpuset)
/dev/ndbus0region0fsdax/csi-66-bcbfbd4fad181ad3a7f1eb7d641e996ba48246a2f8e0ec39bc54b489 on /pmem type xfs (rw,seclabel,relatime,attr2,dax=always,inode64,logbufs=8,logbsize=32k,noquota)
tmpfs on /tmp/memverge type tmpfs (rw,seclabel,nr_inodes=409600)
/dev/vda1 on /etc/hosts type ext4 (rw,seclabel,relatime)
/dev/vda1 on /dev/termination-log type ext4 (rw,seclabel,relatime)
/dev/vda1 on /etc/hostname type ext4 (rw,seclabel,relatime)
/dev/vda1 on /etc/resolv.conf type ext4 (rw,seclabel,relatime)
shm on /dev/shm type tmpfs (rw,seclabel,nosuid,nodev,noexec,relatime,size=65536k)
tmpfs on /var/run/secrets/kubernetes.io/serviceaccount type tmpfs (ro,seclabel,relatime)
proc on /proc/bus type proc (ro,relatime)
proc on /proc/fs type proc (ro,relatime)
proc on /proc/irq type proc (ro,relatime)
proc on /proc/sys type proc (ro,relatime)
proc on /proc/sysrq-trigger type proc (ro,relatime)
tmpfs on /proc/acpi type tmpfs (ro,seclabel,relatime)
tmpfs on /proc/kcore type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/keys type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/latency_stats type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/timer_list type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/sched_debug type tmpfs (rw,seclabel,nosuid,size=65536k,mode=755)
tmpfs on /proc/scsi type tmpfs (ro,seclabel,relatime)
tmpfs on /sys/firmware type tmpfs (ro,seclabel,relatime)
Here are the objects that I used. Local volume (same as in description):
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: shared-volume
spec:
capacity:
storage: 8Gi
accessModes:
- ReadWriteMany
storageClassName: local-storage
local:
path: /tmp
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: storage
operator: In
values:
- pmem
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: shared-volume-claim
spec:
storageClassName: local-storage
volumeName: shared-volume
accessModes:
- ReadWriteMany
resources:
requests:
storage: 8Gi
Daemonset:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: sleep
spec:
selector:
matchLabels:
name: sleep
template:
metadata:
labels:
name: sleep
spec:
containers:
- name: sleep
image: busybox
command:
- sleep
- "1000000"
volumeMounts:
- name: memverge
mountPath: /tmp/memverge
- name: pmem-csi-ephemeral-volume
mountPath: /pmem
volumes:
- name: memverge
persistentVolumeClaim:
claimName: shared-volume-claim
- name: pmem-csi-ephemeral-volume
csi:
driver: pmem-csi.intel.com
fsType: "xfs"
volumeAttributes:
size: "100Mi"
Pod:
apiVersion: v1
kind: Pod
metadata:
name: pod-b
spec:
containers:
- name: sleep
image: busybox
command:
- sleep
- "1000000"
volumeMounts:
- name: memverge
mountPath: /tmp/memverge
volumes:
- name: memverge
persistentVolumeClaim:
claimName: shared-volume-claim
Does it perhaps matter where volumes are mounted inside the containers?
I would avoid mounting volumes on top of each other, if it can be avoided. I don't have a particular reason, it just seems unnecessarily complicated.
For hostpath, distributed provisioning from v1.6.0 would be needed to get all pods of the DaemonSet running. But it looks like I broke CSI ephemeral volume support in that driver when adding capacity simulation in that release. Somehow that didn't show up in tests... because CSI ephemeral volume support is not tested with that driver. Will fix both.
To use hostpath:
make push REGISTRY_NAME=pohly IMAGE_TAGS=2021-03-09-2
HOSTPATHPLUGIN_REGISTRY=pohly HOSTPATHPLUGIN_TAG=2021-03-09-2 /nvme/gopath/src/github.com/kubernetes-csi/csi-driver-host-path/deploy/kubernetes-distributed/deploy.sh
If you don't want to build yourself, you can also use the image that I pushed and directly deploy.
Then use this DaemonSet:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: sleep
spec:
selector:
matchLabels:
name: sleep
template:
metadata:
labels:
name: sleep
spec:
containers:
- name: sleep
image: busybox
command:
- sleep
- "1000000"
volumeMounts:
- name: memverge
mountPath: /tmp/memverge
- name: pmem-csi-ephemeral-volume
mountPath: /pmem
volumes:
- name: memverge
persistentVolumeClaim:
claimName: shared-volume-claim
- name: pmem-csi-ephemeral-volume
csi:
driver: hostpath.csi.k8s.io
Thanks for the information. I forgot to mention that the issue was found in the OpenShift environment. I'm not sure if this is caused by OpenShift.
I will try not mounting on the same path.
@Tianyang-Zhang Did using different paths help?
Sorry about the late update. I tried using a different path(/home/shared
) but still having this issue. The SELinux was disabled.
[root@memory-machine-28ql5 /]# ls -l /home/shared/
ls: cannot open directory '/home/shared/': Permission denied
[root@memory-machine-28ql5 /]# ls -l /home/
total 0
drwxr-xr-x. 3 root root 22 Apr 9 00:08 etc
drwxr-xr-x. 1 root root 29 Apr 9 00:08 memverge
drwxr-xr-x. 3 root root 22 Apr 9 00:08 opt
drwxrwsr-x. 3 root 1000960000 81 Apr 9 23:15 shared
Can you reproduce it with the CSI hostpath driver instead of PMEM-CSI? v1.6.2 should work out of the box, i.e. no image building needed.
If yes, then this is something that can be reported to Red Hat.
When I trying to create your daemonSet example, I got this error:
Normal Scheduled 29s default-scheduler Successfully assigned injection/sleep-lvqhv to osc-5k68w-worker-9c42p
Warning FailedMount 13s (x6 over 29s) kubelet MountVolume.NewMounter initialization failed for volume "pmem-csi-ephemeral-volume" : volume mode "Ephemeral" not supported by driver hostpath.csi.k8s.io (no CSIDriver object)
Should I build a image from source?
How did you install the CSI hostpath driver? When you install via https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/deploy/kubernetes-distributed/deploy.sh, then it should install a CSIDriver object from https://github.com/kubernetes-csi/csi-driver-host-path/blob/master/deploy/kubernetes-distributed/hostpath/csi-hostpath-driverinfo.yaml
I rechecked the whole cluster system and found selinux was re-enabled. This issue is gone after disabled selinux. Sorry about the confusion and extra time you spent!
But the solution can't be "disable SELinux", right?
It might require some extra work, but ideally it should also work with SELinux enabled - whatever "it" is that was failing.
But the solution can't be "disable SELinux", right?
It might require some extra work, but ideally it should also work with SELinux enabled - whatever "it" is that was failing.
You are right. It might related to how daemonSet
works? I only met this issue when using daemonSet
.
FYI, we also reproduced this issue without using any CSI driver on Diamanti cluster(k8s). Disable SELinux also fixed it.
I created a local PV and PVC with local storage class(no provisioner) and
readWriteMany
access mode for storage sharing between pods:Then I created a daemonSet mount to this volume(path
/tmp/memverge
). This daemonSet uses PMEM-CSI to provision PMEM by CSI ephemeral volume(I'm using OpenShift 4.6 and generic ephemeral volume somehow is not supported). Everything works fine and I can attach to my pods(saypod A
) and access the mounted directory. But if I create another pod(saypod B
, which is running on the same node aspod A
) mounting to the same local PV, I no longer able to access/tmp/memverge
inpod A
and get error:The permission in container is correct:
If I create more pods mounting to the same local PV, all these pods works fine and I am able to access the mounted dir. But not the pod A.
If I remove the CSI ephemeral volume part in the daemonSet and re-do everything, this issue is gone. The volume spec for PMEM-CSI is as following:
This issue seems only happens when
daemonSet
is involved. I haven't do