Open benethon opened 1 year ago
Some other things we've tried today:
ls
.Hi @benethon , thank you for bringing this to our attention, apologies for the delayed response. I have reproduced this behavior on v1.7.1 of the driver, with k8s v1.28. We will have to push out a PR to address this so that the securityContext is not ignored.
postgres-0:/$ ls -lah /var/lib/postgresql/
total 4K
drwxr-xr-x 1 postgres postgres 18 Oct 6 01:04 .
drwxr-xr-x 1 root root 24 Oct 6 01:04 ..
drwx------ 2 2000 2000 6.0K Dec 8 20:32 data
postgres-0:/$ exit
exit
[zatzsea@dev-dsk-zatzsea-1a-5f552df4 csi-driver-1202]$ cat storageclass.yaml
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
storageclass.kubernetes.io/is-default-class: "true"
name: efs-sc
parameters:
basePath: /dynamic_provisioning
directoryPerms: "700"
fileSystemId: fs-redacted
gidRangeEnd: "2000"
gidRangeStart: "1000"
provisionerID: efs.csi.aws.com
provisioningMode: efs-ap
provisioner: efs.csi.aws.com
reclaimPolicy: Retain
volumeBindingMode: Immediate
[zatzsea@dev-dsk-zatzsea-1a-5f552df4 csi-driver-1202]$ cat statefulset.yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: postgres
spec:
minReadySeconds: 10
persistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Retain
podManagementPolicy: OrderedReady
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: postgres
serviceName: postgres
template:
metadata:
labels:
app: postgres
spec:
containers:
- command:
- sleep
- infinity
env:
- name: PGDATA
value: /var/lib/postgresql/data/pgdata
image: timescale/timescaledb:2.12.1-pg14
imagePullPolicy: IfNotPresent
name: postgres
ports:
- containerPort: 5432
protocol: TCP
resources: {}
volumeMounts:
- mountPath: /var/lib/postgresql/data
name: postgresdb
subPath: pgdata
restartPolicy: Always
securityContext:
fsGroup: 1035
fsGroupChangePolicy: Always
runAsNonRoot: true
runAsUser: 1035
terminationGracePeriodSeconds: 30
updateStrategy:
rollingUpdate:
partition: 0
type: RollingUpdate
volumeClaimTemplates:
- apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: postgresdb
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: efs-sc
volumeMode: Filesystem
@seanzatzdev-amazon Do we have any progress on this? I am facing this issue in k8s 1.29 too.
@benethon Did you face the issue in k8s version < 1.27? if yes, then how did you work around it (if you managed to by any chance)?
@snowmanstark no, we didn't try anything less than 1.27. Worked around it temporarily by using an EBS volume rather than EFS
@nishant221 does your PR #1152 fix for this issue? if yes, then did 1.7.6 release ship #1152?
@seanzatzdev-amazon Is the fix for this issue in 1.7.6 release?
@seanzatzdev-amazon Is there any update on this fix. My stateful set is completely useless without this being fixed
Same problem here.
Mounted as 1002
❯ kubectl exec -it pod/atlantis-0 -n atlantis -c atlantis -- bash
atlantis@atlantis-0:/$ ls -lah
total 4.0K
drwxr-xr-x. 1 root root 72 May 15 13:23 .
drwxr-xr-x. 1 root root 72 May 15 13:23 ..
drwxrwxr-x. 4 1002 1002 6.0K May 15 13:23 atlantis-data
context is 1000
securityContext:
fsGroup: 1000
fsGroupChangePolicy: Always
runAsUser: 1000
pvc
❯ kubectl get pvc -n atlantis atlantis-data -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
meta.helm.sh/release-name: atlantis
meta.helm.sh/release-namespace: atlantis
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: efs.csi.aws.com
volume.kubernetes.io/storage-provisioner: efs.csi.aws.com
creationTimestamp: "2024-05-15T13:13:15Z"
finalizers:
- kubernetes.io/pvc-protection
labels:
app: atlantis
app.kubernetes.io/managed-by: Helm
chart: atlantis-5.0.2
helm.sh/chart: atlantis-5.0.2
heritage: Helm
release: atlantis
name: atlantis-data
namespace: atlantis
resourceVersion: "565062419"
uid: 058bb4a6-8151-4eed-bff6-94e1e3de065c
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: efs-sc
volumeMode: Filesystem
volumeName: pvc-058bb4a6-8151-4eed-bff6-94e1e3de065c
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 50Gi
phase: Bound
SC
❯ kubectl get storageclass efs-sc -o yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
meta.helm.sh/release-name: aws-efs-csi-driver
meta.helm.sh/release-namespace: kube-system
creationTimestamp: "2022-03-30T18:58:11Z"
labels:
app.kubernetes.io/managed-by: Helm
name: efs-sc
resourceVersion: "72583228"
uid: 481c4ac5-b2e5-4c25-bbfb-a07c5104532b
mountOptions:
- tls
parameters:
basePath: /dynamic_provisioning
directoryPerms: "775"
fileSystemId: fs-00a080564b9d87ff4
gidRangeEnd: "2000"
gidRangeStart: "1000"
provisioningMode: efs-ap
provisioner: efs.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
Same problem here.
Mounted as 1002
❯ kubectl exec -it pod/atlantis-0 -n atlantis -c atlantis -- bash atlantis@atlantis-0:/$ ls -lah total 4.0K drwxr-xr-x. 1 root root 72 May 15 13:23 . drwxr-xr-x. 1 root root 72 May 15 13:23 .. drwxrwxr-x. 4 1002 1002 6.0K May 15 13:23 atlantis-data
context is 1000
securityContext: fsGroup: 1000 fsGroupChangePolicy: Always runAsUser: 1000
pvc
❯ kubectl get pvc -n atlantis atlantis-data -o yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: annotations: meta.helm.sh/release-name: atlantis meta.helm.sh/release-namespace: atlantis pv.kubernetes.io/bind-completed: "yes" pv.kubernetes.io/bound-by-controller: "yes" volume.beta.kubernetes.io/storage-provisioner: efs.csi.aws.com volume.kubernetes.io/storage-provisioner: efs.csi.aws.com creationTimestamp: "2024-05-15T13:13:15Z" finalizers: - kubernetes.io/pvc-protection labels: app: atlantis app.kubernetes.io/managed-by: Helm chart: atlantis-5.0.2 helm.sh/chart: atlantis-5.0.2 heritage: Helm release: atlantis name: atlantis-data namespace: atlantis resourceVersion: "565062419" uid: 058bb4a6-8151-4eed-bff6-94e1e3de065c spec: accessModes: - ReadWriteOnce resources: requests: storage: 50Gi storageClassName: efs-sc volumeMode: Filesystem volumeName: pvc-058bb4a6-8151-4eed-bff6-94e1e3de065c status: accessModes: - ReadWriteOnce capacity: storage: 50Gi phase: Bound
SC
❯ kubectl get storageclass efs-sc -o yaml apiVersion: storage.k8s.io/v1 kind: StorageClass metadata: annotations: meta.helm.sh/release-name: aws-efs-csi-driver meta.helm.sh/release-namespace: kube-system creationTimestamp: "2022-03-30T18:58:11Z" labels: app.kubernetes.io/managed-by: Helm name: efs-sc resourceVersion: "72583228" uid: 481c4ac5-b2e5-4c25-bbfb-a07c5104532b mountOptions: - tls parameters: basePath: /dynamic_provisioning directoryPerms: "775" fileSystemId: fs-00a080564b9d87ff4 gidRangeEnd: "2000" gidRangeStart: "1000" provisioningMode: efs-ap provisioner: efs.csi.aws.com reclaimPolicy: Delete volumeBindingMode: Immediate
I had to use a work around and make a new storage class on which I set the gid and uid to what I need. And then in my statefulset I use this new storage class. Hope that helps. If you would like to see the manifest let me know and I can share that I get home.
@gbsingh1993
Thats what I had to do as well, create a specific SC just for this app. Which is fine, but its def. not ideal.
Create a storage-class with a fix groupid 1000 works for me:
> cat storageclass.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
annotations:
meta.helm.sh/release-name: efs-csi
meta.helm.sh/release-namespace: core-efs-csi
labels:
app.kubernetes.io/managed-by: Helm
name: efs-sc-gid1000
parameters:
directoryPerms: "700"
fileSystemId: fs-[YOUR_FLE_SYSTEM_ID]
gid: "1000"
uid: "1000"
provisioningMode: efs-ap
provisioner: efs.csi.aws.com
reclaimPolicy: Delete
volumeBindingMode: Immediate
> kubectl apply -f storageclass.yaml
And use:
storageClassName: efs-sc-gid1000
And use:
securityContext:
fsGroup: 1000
runAsUser: 1000
runAsGroup: 1000
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
/kind bug
What happened?
I have two clusters, one v1.27 named
eks-poc
and another new one created recently on v1.28 calledeks-dev
. Oneks-dev
, I noticed a problem that despite setting a securityContext to force the fsGroup user of the mounted volume to be 1035 [1], it doesn't respect this and instead sets it at 1999 (one off the upper limit set in the storage class [2]).We didn't have this problem on
eks-poc
, but we updated it this morning to v1.28 and the problem appeared, so to me it seems like the issue is related to Kubernetes 1.28.I installed the EFS driver manually on
eks-poc
and using the EKS Add-on oneks-dev
. The image of the efs-plugin container in the efs-csi-controller pod oneks-poc
is602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/aws-efs-csi-driver:v1.5.7
and the image oneks-dev
is602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/aws-efs-csi-driver:v1.7.1
- so different versions, but the common thing that makes it stop working is Kubernetes 1.28.Some other thing's we've tried: rolling back the
eks-dev
EFS driver version to1.5.7
- still happens, but the POSIX user is now 1000 rather than 199x. I haven't checked the change-log but I assume it was switched around to count down from the maximum gId, as evidenced by the log line "Allocator found GID which is already in use: -1 - trying next one."What you expected to happen?
mounted volume to be owned by user 1035 from the securityContext, not the one set by the provisioner
How to reproduce it (as minimally and precisely as possible)?
[1] Statefulset YAML (also happening on other deployments, also note the command to override the entrypoint):
[2] StorageClass:
This outputs (with 1.5.7) - note the 1000 user, not 1035
Environment
kubectl version
): see abovePlease also attach debug logs to help us better diagnose
Attached csi-provisioner.txt efs-plugin.txt