kubernetes-sigs / aws-efs-csi-driver

CSI Driver for Amazon EFS https://aws.amazon.com/efs/
Apache License 2.0
693 stars 524 forks source link

Delete PVC do not delete PV and his access point with dynamic provisioning #1382

Closed raulgdUOC closed 6 days ago

raulgdUOC commented 1 week ago

/kind bug

What happened? When I try to delete the PVC, its PV does not get deleted (with reclaimPolicy: Delete ), and its access point remains. The controller starts to generate logs trying to mount that access point, but it can't mount it.

What you expected to happen? When I delete the PVC, the content inside that PV should be deleted along with its access point.

How to reproduce it (as minimally and precisely as possible)? I have been using this to deploy the storage class, pod, and PVC -> https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/examples/kubernetes/dynamic_provisioning/specs

I used Helm to deploy it. In the values, the most important change that I made is: controller.deleteAccessPointRootDir=true

Anything else we need to know?:

Environment

Please also attach debug logs to help us better diagnose

This is when I create a new PV with the storage class

$ k logs -n kube-system efs-csi-controller-59c98bffc-zhz4c
I0621 16:19:07.507607       1 controller.go:314] Using /dynamic_provisioning/default/efs-claim-6a0453fa-404c-4a5f-aeaa-35e22da6a143 as the access point directory.
I0621 16:19:07.507656       1 cloud.go:205] Calling Create AP with input: {
  ClientToken: "pvc-f24961cc-4663-413c-87a8-ad4dd4f31128",
  FileSystemId: "fs-XXXXXXXXXXXXXXX",
  PosixUser: {
    Gid: 1002,
    Uid: 1002
  },
  RootDirectory: {
    CreationInfo: {
      OwnerGid: 1002,
      OwnerUid: 1002,
      Permissions: "700"
    },
    Path: "/dynamic_provisioning/default/efs-claim-6a0453fa-404c-4a5f-aeaa-35e22da6a143"
  },
  Tags: [{
      Key: "efs.csi.aws.com/cluster",
      Value: "true"
    }]
}
I0621 16:19:07.591270       1 cloud.go:213] Create AP response : {
  AccessPointArn: "arn:aws:elasticfilesystem:eu-west-1:123123123123:access-point/fsap-YYYYYYYYYYYYYYYY",
  AccessPointId: "fsap-YYYYYYYYYYYYYYYY",
  ClientToken: "pvc-f24961cc-4663-413c-87a8-ad4dd4f31128",
  FileSystemId: "fs-XXXXXXXXXXXXXXX",
  LifeCycleState: "creating",
  OwnerId: "123123123123",
  PosixUser: {
    Gid: 1002,
    Uid: 1002
  },
  RootDirectory: {
    CreationInfo: {
      OwnerGid: 1002,
      OwnerUid: 1002,
      Permissions: "700"
    },
    Path: "/dynamic_provisioning/default/efs-claim-6a0453fa-404c-4a5f-aeaa-35e22da6a143"
  },
  Tags: [{
      Key: "efs.csi.aws.com/cluster",
      Value: "true"
    }]
}

After delete the POD and the PVC, I start to see this logs:

$ k logs -n kube-system efs-csi-controller-59c98bffc-zhz4c
Mounting arguments: -t efs -o tls,iam fs-XXXXXXXXXXXXXXX /var/lib/csi/pv/fsap-YYYYYYYYYYYYYYYY
Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"
Mount attempt 1/3 failed due to timeout after 15 sec, wait 0 sec before next attempt.
Mount attempt 2/3 failed due to timeout after 15 sec, wait 0 sec before next attempt.
b'mount.nfs4: Connection timed out'
Warning: config file does not have fips_mode_enabled item in section mount.. You should be able to find a new config file in the same folder as current config file /etc/amazon/efs/efs-utils.conf. Consider update the new config file to latest config file. Use the default value [fips_mode_enabled = False].Warning: config file does not have retry_nfs_mount_command item in section mount.. You should be able to find a new config file in the same folder as current config file /etc/amazon/efs/efs-utils.conf. Consider update the new config file to latest config file. Use the default value [retry_nfs_mount_command = True].
E0621 16:27:19.739399       1 mount_linux.go:231] Mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t efs -o tls,iam fs-XXXXXXXXXXXXXXX /var/lib/csi/pv/fsap-YYYYYYYYYYYYYYYY
Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"
Mount attempt 1/3 failed due to timeout after 15 sec, wait 0 sec before next attempt.
Mount attempt 2/3 failed due to timeout after 15 sec, wait 0 sec before next attempt.
b'mount.nfs4: Connection timed out'
Warning: config file does not have fips_mode_enabled item in section mount.. You should be able to find a new config file in the same folder as current config file /etc/amazon/efs/efs-utils.conf. Consider update the new config file to latest config file. Use the default value [fips_mode_enabled = False].Warning: config file does not have retry_nfs_mount_command item in section mount.. You should be able to find a new config file in the same folder as current config file /etc/amazon/efs/efs-utils.conf. Consider update the new config file to latest config file. Use the default value [retry_nfs_mount_command = True].
E0621 16:27:19.739482       1 driver.go:106] GRPC error: rpc error: code = Internal desc = Could not mount "fs-XXXXXXXXXXXXXXX" at "/var/lib/csi/pv/fsap-YYYYYYYYYYYYYYYY": mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t efs -o tls,iam fs-XXXXXXXXXXXXXXX /var/lib/csi/pv/fsap-YYYYYYYYYYYYYYYY
Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri"
Mount attempt 1/3 failed due to timeout after 15 sec, wait 0 sec before next attempt.
Mount attempt 2/3 failed due to timeout after 15 sec, wait 0 sec before next attempt.
b'mount.nfs4: Connection timed out'
Warning: config file does not have fips_mode_enabled item in section mount.. You should be able to find a new config file in the same folder as current config file /etc/amazon/efs/efs-utils.conf. Consider update the new config file to latest config file. Use the default value [fips_mode_enabled = False].Warning: config file does not have retry_nfs_mount_command item in section mount.. You should be able to find a new config file in the same folder as current config file /etc/amazon/efs/efs-utils.conf. Consider update the new config file to latest config file. Use the default value [retry_nfs_mount_command = True].

This is a description of a PV from an another test that I have made. Some parts may not agree with the previous logs. This is the result after delete the PVC.

Name:            pvc-b9219e15-e3ce-4e2a-859f-2b09f487f9ab
Labels:          <none>
Annotations:     pv.kubernetes.io/provisioned-by: efs.csi.aws.com
                 volume.kubernetes.io/provisioner-deletion-secret-name:
                 volume.kubernetes.io/provisioner-deletion-secret-namespace:
Finalizers:      [kubernetes.io/pv-protection]
StorageClass:    efs-sc
Status:          Released
Claim:           default/efs-claim
Reclaim Policy:  Delete
Access Modes:    RWX
VolumeMode:      Filesystem
Capacity:        5Gi
Node Affinity:   <none>
Message:
Source:
    Type:              CSI (a Container Storage Interface (CSI) volume source)
    Driver:            efs.csi.aws.com
    FSType:
    VolumeHandle:      fs-XXXXXXXXXX::fsap-YYYYYYYYYY
    ReadOnly:          false
    VolumeAttributes:      storage.kubernetes.io/csiProvisionerIdentity=1324534236656-1234-efs.csi.aws.com
Events:
  Type     Reason              Age   From                                                                                     Message
  ----     ------              ----  ----                                                                                     -------
  Warning  VolumeFailedDelete  0s    efs.csi.aws.com_efs-csi-controller-59c98bffc-zhz4c_77d10c4c-a479-4f57-897a-32a89103112f  rpc error: code = DeadlineExceeded desc = context deadline exceeded
mskanth972 commented 1 week ago

Hi @raulgdUOC , Thank you for bringing this issue to our attention. We have investigated the problem where persistent volumes were getting stuck during deletion and have identified the cause. To address this, we recently made updates to the CSI provisioner sidecars and adjusted the RBAC permissions specifically for managing persistent volumes. This should fix the issue.

raulgdUOC commented 1 week ago

Hi @mskanth972, thanks for your response,

I have changed the version of the images and updated the ClusterRole, but the problem still persists with the same logs:

Show logs controller ``` I0625 16:11:28.137400 1 gid_allocator.go:32] Received getNextGid for fsId: fs-XXXXXXXXXXXXXXX, min: 1000, max: 2000 I0625 16:11:28.137425 1 gid_allocator.go:95] Allocator found unused GID: 1000 I0625 16:11:28.137441 1 controller.go:307] Using PV name for access point directory. I0625 16:11:28.137458 1 controller.go:314] Using /dynamic_provisioning/pvc-2fd8f853-1fc7-4496-af45-836dc3af9b7b as the access point directory. I0625 16:11:28.137471 1 cloud.go:205] Calling Create AP with input: { ClientToken: "pvc-2fd8f853-1fc7-4496-af45-836dc3af9b7b", FileSystemId: "fs-XXXXXXXXXXXXXXX", PosixUser: { Gid: 1000, Uid: 1000 }, RootDirectory: { CreationInfo: { OwnerGid: 1000, OwnerUid: 1000, Permissions: "700" }, Path: "/dynamic_provisioning/pvc-2fd8f853-1fc7-4496-af45-836dc3af9b7b" }, Tags: [{ Key: "efs.csi.aws.com/cluster", Value: "true" }] } I0625 16:11:28.198865 1 cloud.go:213] Create AP response : { AccessPointArn: "arn:aws:elasticfilesystem:eu-west-1:123456789000:access-point/fsap-YYYYYYYYYYYYYYYY", AccessPointId: "fsap-YYYYYYYYYYYYYYYY", ClientToken: "pvc-2fd8f853-1fc7-4496-af45-836dc3af9b7b", FileSystemId: "fs-XXXXXXXXXXXXXXX", LifeCycleState: "creating", OwnerId: "123456789000", PosixUser: { Gid: 1000, Uid: 1000 }, RootDirectory: { CreationInfo: { OwnerGid: 1000, OwnerUid: 1000, Permissions: "700" }, Path: "/dynamic_provisioning/pvc-2fd8f853-1fc7-4496-af45-836dc3af9b7b" }, Tags: [{ Key: "efs.csi.aws.com/cluster", Value: "true" }] } I0625 16:14:27.508814 1 controller.go:374] DeleteVolume: called with args {VolumeId:fs-XXXXXXXXXXXXXXX::fsap-YYYYYYYYYYYYYYYY Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} I0625 16:14:27.564180 1 mount_linux.go:244] Detected OS without systemd I0625 16:14:27.564208 1 mount_linux.go:219] Mounting cmd (mount) with arguments (-t efs -o tls,iam fs-XXXXXXXXXXXXXXX /var/lib/csi/pv/fsap-YYYYYYYYYYYYYYYY) I0625 16:14:38.510765 1 controller.go:374] DeleteVolume: called with args {VolumeId:fs-XXXXXXXXXXXXXXX::fsap-YYYYYYYYYYYYYYYY Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} I0625 16:14:38.537314 1 mount_linux.go:219] Mounting cmd (mount) with arguments (-t efs -o tls,iam fs-XXXXXXXXXXXXXXX /var/lib/csi/pv/fsap-YYYYYYYYYYYYYYYY) I0625 16:14:50.512496 1 controller.go:374] DeleteVolume: called with args {VolumeId:fs-XXXXXXXXXXXXXXX::fsap-YYYYYYYYYYYYYYYY Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} I0625 16:14:50.539637 1 mount_linux.go:219] Mounting cmd (mount) with arguments (-t efs -o tls,iam fs-XXXXXXXXXXXXXXX /var/lib/csi/pv/fsap-YYYYYYYYYYYYYYYY) I0625 16:15:04.513412 1 controller.go:374] DeleteVolume: called with args {VolumeId:fs-XXXXXXXXXXXXXXX::fsap-YYYYYYYYYYYYYYYY Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} I0625 16:15:04.541630 1 mount_linux.go:219] Mounting cmd (mount) with arguments (-t efs -o tls,iam fs-XXXXXXXXXXXXXXX /var/lib/csi/pv/fsap-YYYYYYYYYYYYYYYY) I0625 16:15:22.514911 1 controller.go:374] DeleteVolume: called with args {VolumeId:fs-XXXXXXXXXXXXXXX::fsap-YYYYYYYYYYYYYYYY Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} I0625 16:15:22.545158 1 mount_linux.go:219] Mounting cmd (mount) with arguments (-t efs -o tls,iam fs-XXXXXXXXXXXXXXX /var/lib/csi/pv/fsap-YYYYYYYYYYYYYYYY) I0625 16:15:48.516383 1 controller.go:374] DeleteVolume: called with args {VolumeId:fs-XXXXXXXXXXXXXXX::fsap-YYYYYYYYYYYYYYYY Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} I0625 16:15:48.554568 1 mount_linux.go:219] Mounting cmd (mount) with arguments (-t efs -o tls,iam fs-XXXXXXXXXXXXXXX /var/lib/csi/pv/fsap-YYYYYYYYYYYYYYYY) I0625 16:16:30.517815 1 controller.go:374] DeleteVolume: called with args {VolumeId:fs-XXXXXXXXXXXXXXX::fsap-YYYYYYYYYYYYYYYY Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} I0625 16:16:30.576784 1 mount_linux.go:219] Mounting cmd (mount) with arguments (-t efs -o tls,iam fs-XXXXXXXXXXXXXXX /var/lib/csi/pv/fsap-YYYYYYYYYYYYYYYY) I0625 16:17:44.519414 1 controller.go:374] DeleteVolume: called with args {VolumeId:fs-XXXXXXXXXXXXXXX::fsap-YYYYYYYYYYYYYYYY Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0} I0625 16:17:44.577076 1 mount_linux.go:219] Mounting cmd (mount) with arguments (-t efs -o tls,iam fs-XXXXXXXXXXXXXXX /var/lib/csi/pv/fsap-YYYYYYYYYYYYYYYY) E0625 16:17:59.082977 1 mount_linux.go:231] Mount failed: exit status 32 Mounting command: mount Mounting arguments: -t efs -o tls,iam fs-XXXXXXXXXXXXXXX /var/lib/csi/pv/fsap-YYYYYYYYYYYYYYYY Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri" Mount attempt 1/3 failed due to timeout after 15 sec, wait 0 sec before next attempt. Mount attempt 2/3 failed due to timeout after 15 sec, wait 0 sec before next attempt. b'mount.nfs4: Connection timed out' Warning: config file does not have fips_mode_enabled item in section mount.. You should be able to find a new config file in the same folder as current config file /etc/amazon/efs/efs-utils.conf. Consider update the new config file to latest config file. Use the default value [fips_mode_enabled = False].Warning: config file does not have retry_nfs_mount_command item in section mount.. You should be able to find a new config file in the same folder as current config file /etc/amazon/efs/efs-utils.conf. Consider update the new config file to latest config file. Use the default value [retry_nfs_mount_command = True]. E0625 16:17:59.083086 1 driver.go:106] GRPC error: rpc error: code = Internal desc = Could not mount "fs-XXXXXXXXXXXXXXX" at "/var/lib/csi/pv/fsap-YYYYYYYYYYYYYYYY": mount failed: exit status 32 Mounting command: mount Mounting arguments: -t efs -o tls,iam fs-XXXXXXXXXXXXXXX /var/lib/csi/pv/fsap-YYYYYYYYYYYYYYYY Output: Could not start amazon-efs-mount-watchdog, unrecognized init system "aws-efs-csi-dri" Mount attempt 1/3 failed due to timeout after 15 sec, wait 0 sec before next attempt. Mount attempt 2/3 failed due to timeout after 15 sec, wait 0 sec before next attempt. b'mount.nfs4: Connection timed out' Warning: config file does not have fips_mode_enabled item in section mount.. You should be able to find a new config file in the same folder as current config file /etc/amazon/efs/efs-utils.conf. Consider update the new config file to latest config file. Use the default value [fips_mode_enabled = False].Warning: config file does not have retry_nfs_mount_command item in section mount.. You should be able to find a new config file in the same folder as current config file /etc/amazon/efs/efs-utils.conf. Consider update the new config file to latest config file. Use the default value [retry_nfs_mount_command = True]. E0625 16:18:00.178484 1 mount_linux.go:231] Mount failed: exit status 32 ```

Looks like it can't mount the efs at that access point created after delete the PVC and then, it get stucked with the PV in RECLAIM POLICY: Delete and STATUS: Released with the error showd in the first message and the logs showed before.

The procedure to test the feature deleteAccessPointRootDir: true is the next:

  1. Create the EFS with all the necessary configurations.
  2. Deploy the Helm chart with deleteAccessPointRootDir: true and CSI provisioner sidecars and adjusted the RBAC permissions
  3. Deploy the example provided in the Dynamic Provisioning
  4. Check that everything is working by inspecting the logs, executing commands in the pod, and verifying if the pod is writing.
  5. Delete the pod
  6. Delete PVC

Am I doing something wrong, or is this procedure not the intended functionality for deleteAccessPointRootDir: true?

mskanth972 commented 1 week ago

When I delete the PVC, the content inside that PV should be deleted along with its access point.

For the above we need to enable this parameter deleteAccessPointRootDir: true for sure. Can I know the command you are running for deleting PVC and also where you are assigning this parameter reclaimPolicy: Delete?

raulgdUOC commented 1 week ago

For the above we need to enable this parameter deleteAccessPointRootDir: true for sure. Can I know the command you are running for deleting PVC and also where you are assigning this parameter reclaimPolicy: Delete?

I am using kubectl delete pvc efs-claim to delete the PVC.

I am assigning the parameter reclaimPolicy: Delete in the Storage Class, but I also tested it by doing a patch with the storage class set to Retain.

raulgdUOC commented 6 days ago

Hello @mskanth972,

The root problem was that when the controller tried to mount the EFS, it couldn't reach the EFS because the Security Group only allowed traffic with the CIDR of my worker nodes. The network of the Pod is different from that of the node, so the controller couldn't mount the EFS or delete the access point and its content, causing it to get stuck in the process. To solve this, I added the label hostNetwork: true, and it worked.

Additionally, the information you provided about the ClusterRole also worked to delete the PV. Without the patch in the ClusterRole, the PV got stuck in the terminating stage.