kubernetes-sigs / vsphere-csi-driver

vSphere storage Container Storage Interface (CSI) plugin
https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/index.html
Apache License 2.0
296 stars 179 forks source link

For windows mount point, check directory mode for symlink check instead of CSI proxy isSymLink() function, to handle corrupted mount point in case node reboot #2868

Closed akankshapanse closed 5 months ago

akankshapanse commented 5 months ago

What this PR does / why we need it: If windows worker node crashes/reboots abruptly, the mount point directory created by CO for CSI volumes published gets corrupted sometimes and the symlink/dir fails to read after reboot/crash. This causes volume attach to fail after nodes are up back. This MR allow unmounting and reformating/remounting of the mount point on such windows worker node after node reboot/crash, if the directory exists but cannot be read due to some issue while crashing.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #

Testing done: Fix tested on ST setup where the issue was originally seen in which after node reboot, pods running on that windows worker node enter into "Unknown" state and stay in the same state forever until they either get rescheduled to another node or restarted with mount point cleaned up. After fix, all pods on nodes crashing/rebooting came back to running state in sometime after node is up.

Special notes for your reviewer:

Release note:

For windows mount point, check directory mode for symlink check instead of CSI proxy isSymLink() function, to handle corrupted mount point in case node reboot
divyenpatel commented 5 months ago

/ok-to-test

akankshapanse commented 5 months ago

block vanilla windows precheckin pipeline result:

Ran 83 of 848 Specs in 15245.334 seconds
FAIL! -- 
44 Passed | 39 Failed | 0 Pending | 765 Skipped --- 
FAIL: TestE2E (15245.66s) 
FAIL 
Ginkgo ran 1 suite in 4h15m11.714112564s
Test Suite Failed
Ran 39 of 848 Specs in 20026.357 seconds
FAIL! -- 
32 Passed | 7 Failed | 0 Pending | 809 Skipped ---
FAIL: TestE2E (20026.66s) 
FAIL 
Ginkgo ran 1 suite in 5h35m1.340395727s 
Test Suite Failed
akankshapanse commented 5 months ago

another block vanilla windows precheckin pipeline result:

PR 2868
Ran 79 of 848 Specs in 30970.815 seconds 
FAIL! -- 78 Passed | 1 Failed | 3 Flaked | 0 Pending | 769 Skipped --- 
FAIL: TestE2E (30970.99s) 
FAIL 
Ginkgo ran 1 suite in 8h37m23.052645747s 
Test Suite Failed
xing-yang commented 5 months ago

/approve

k8s-ci-robot commented 5 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: akankshapanse, divyenpatel, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubernetes-sigs/vsphere-csi-driver/blob/master/OWNERS)~~ [divyenpatel,xing-yang] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment