hpe-storage / csi-driver

A Container Storage Interface (CSI) driver from HPE
https://scod.hpedev.io
Apache License 2.0
58 stars 53 forks source link

Fix scrubber going too deep into PV data #327

Closed gbarazer closed 1 year ago

gbarazer commented 1 year ago

Fix csi-driver scrubber task walking the whole PV filesystems ($KUBELETDIR/pods) where it only needs to find ephemeral_data.json files at a specific depth ($KUBELETDIR/pods/$POD/volumes/kubernetes.io~csi/$PVC/ephemeral_data.json). This causes lots of problems when pods are using PVs containing lots of files/dirs and is even worse when the PV is using another CSI such as NFS.

gbarazer commented 1 year ago

Added comments for clarity of the path depth logic for skipping files, which is unoptimal due to the use of filepath.Walk.

Regarding numbers, I don't have benchmark reports but we have several clusters using the driver and also using volumes provisioned by the NFS CSI driver. Some of these volumes have millions of files stored in thousands of sharded directories.

We first noticed :

  1. our NFS array processing dozens of thousands of metadata IOPS since the moment we deployed the HPE CSI driver
  2. the metadata traffic originated from the nodes on which some pods were stuck in terminating state
  3. the pods stuck in terminating state were the ones using PVC with lots of files/dirs
  4. other nodes and pods were acting normally
  5. the Scrubber logs in the csi-driver container indicated oddly long times for the scrubbing process (we are not using ephemeral volumes)

We first increased the delay between scrub tasks to 24 hours, and confirmed that the heavy traffic was related to the scrubbing process, with metadata-heavy IOPS during 2-3 hours.

With this fix, we can't even notice the metadata traffic overhead from the scrubber with 10s resolution metrics because the task completes in less than a second.

To reproduce / benchmark this, use a NFS volume filled with lots (~200k files/dirs), even empty files and the scrubber task completion time should increase very noticeably. To measure impact with several nodes, use the same volume but with ReadWriteMany mode and deploy a test pod mounting this volume with a few replicas. Due to the nature of the bug (useless stat() calls on whole filesystems), this is mostly noticeable when using shared file systems such as NFS which have per-file metadata latency, but it could also happen even when using exclusively the HPE CSI driver with block/iscsi volumes : if the volumes contains lots of files/dirs, the scrubber task will pollute the OS page cache with metadata which can have an impact on overall performance as well.

raunakkumar commented 1 year ago
  1. of metadata IOPS since the moment we deploye

Thanks for the details. We do need to ensure scrubbing for ephemeral is not broken but we will take that up internally. We should be good to merge this PR

datamattsson commented 1 year ago

The driver passes CSI e2e tests with this patch on Rocky 8 with K8s 1.26 and the scrubber reaps volumes as intended.

Ran 69 of 7345 Specs in 12635.309 seconds
SUCCESS! -- 69 Passed | 0 Failed | 0 Pending | 7276 Skipped
PASS