dns2utf8 commented 4 years ago

Hi all

I am using this CSI driver to access HPE nimble storage over fiber channel. Lately I noticed that sometimes the fsGroup is not applied to the storage.

Currently, there are three applications on the cluster running on the same node.

Gitlab with working fsGroup
Gitlab without working fsGroup
mfw where the fsGroup works in ~30% of deployments.

The relevant yaml:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: multi-file-writer
  namespace: snapshot-test
spec:
  replicas: 1
  selector:
    matchLabels:
      app: multi-file-writer
  minReadySeconds: 5
  strategy:
    type: Recreate
  template:
    spec:
      securityContext:
        runAsUser: 65534
        fsGroup: 65534

Debugging

The logs did not contain any hints regarding these applications:

grep -ri fsGroup /var/log/nimble* /var/log/syslog

Other containers emitted logs containing FSGroup:nil. Since they did not request a fsGroup that appears to be okay.

Cheers, Stefan

raunakkumar commented 4 years ago

Hi @dns2utf8 Could you please upload the logs for us to review. You should be able to collect them using https://github.com/hpe-storage/csi-driver#log-collector Also, is your issue related to https://github.com/kubernetes/examples/issues/260?

dns2utf8 commented 4 years ago

Hi

Our issue is not related. This setup uses a SAN via FiberChannel and xfs on the LUNs. The logs from the tree nodes are 1.2GB in total. Uploading them will take a while.

dns2utf8 commented 4 years ago

Uploaded the logs here

raunakkumar commented 4 years ago

Thanks but i am unable to reach https://gitlab.gyselroth.net/stefan.schindler/hpe-nimble-logs.

Did you apply the following parameters for the storage class for the underlying pvc listed below?

fsOwner | userId:groupId | The user id and group id that should own the root directory of the filesystem.
fsMode | Octal digits | 1 to 4 octal digits that represent the file mode to be applied to the root directory of the filesystem.

https://github.com/hpe-storage/csi-driver/tree/master/examples/kubernetes/hpe-nimble-storage#provisioning-parameters

dns2utf8 commented 4 years ago

There appears to be some sort of configuration error. Please use the public instance for now: https://gitlab.com/dns2utf8/hpe-nimble-logs

Since I am on a different project for now, I hope @raffis can answer the pvc question.

raunakkumar commented 4 years ago

Hi @dns2utf8 , Thanks for the logs. Didn’t find anything suspicious in the logs wrt fsGroup and runAsUser. Tried some experiments on our cluster and verified that the runAsUser and fsGroup are honored. Could you please elaborate what you meant by 30% of the case worked rest didn’t? Did the pods never went to Running state or were the fsGroup and runAsUser weren’t honored. If it's the latter could you share the output of the commands listed below?

Below is an example of my test

Pod Spec

cat pod.yaml | grep -A 2 securityContext 
securityContext:
 runAsUser: 2157
 fsGroup: 1001

id rkumar uid=2157(rkumar) gid=1001(eng) groups=1001(eng)

*Pod running with user id 2157

 kubectl exec -it fsgroup-pod-1 -c pod-datelog-1 -- sh
/ $ ps
PID   USER     TIME  COMMAND
    1 2157      0:10 /bin/sh 
   75 2157      0:00 sh
  681 2157      0:00 sh
  689 2157      0:00 sleep 1
  690 2157      0:00 ps

Volume is mounted with group 1001 

/ $ cd /data
/data $ ls -ltr
total 2048
-rw-r--r--    1 2157     1001       1902168 Feb 13 16:52 mydata.txt

On the host where the pod is mounted

mount | grep mpath
/dev/mapper/mpathat on /var/lib/kubelet/plugins/hpe.com/mounts/0634be4e62e74eae4d000000000000000000000101 type xfs (rw,relatime,attr2,inode64,noquota)
/dev/mapper/mpathat on /var/lib/kubelet/pods/2921bde4-b999-4a9a-8881-393fccb368d7/volumes/kubernetes.io~csi/pvc-fc29440f-bcc1-47bb-b29d-8559db04e92d/mount type xfs (rw,relatime,attr2,inode64,noquota)

cd /var/lib/kubelet/plugins/hpe.com/mounts/0634be4e62e74eae4d000000000000000000000101
/var/lib/kubelet/plugins/hpe.com/mounts/0634be4e62e74eae4d000000000000000000000101# ls -ltr
total 2048
-rw-r--r-- 1 rkumar eng 1927514 Feb 13 08:59 mydata.txt

shivamerla commented 4 years ago

@dns2utf8 Can you respond to comment above if you are still seeing this issue?

dns2utf8 commented 4 years ago

Hi

So the 30% means this: While testing the deployment for the application I deleted the resources every now and then. Then I realized in roughly 1 of 3 runs the storage would not attach correctly and the software would crash.

raunakkumar commented 4 years ago

Hi @dns2utf8 do you still face the issue with fsGroup. Is the behavior same without fsGroup ?

hpe-storage / csi-driver

fsGroup sometimes works sometimes breaks #107

Debugging