harvester / harvester

Open source hyperconverged infrastructure (HCI) software
https://harvesterhci.io/
Apache License 2.0
3.64k stars 308 forks source link

[BUG] Harvester CSI can no longer mount PVCs, stuck "volume attachment is being deleted" #6048

Open Daemonslayer2048 opened 1 week ago

Daemonslayer2048 commented 1 week ago

Describe the bug Harvester is no longer able to mount any PVCs causing all workloads to wait indefinitely. Harvester CSI throws the following relevant logs:

csi-attacher I0620 17:23:21.202979       1 controller.go:208] Started VA processing "csi-0ba6c8dc618582bf2ec444222752944fa0b6b9139adecbb62d82b69191a4612e"                                                                                                                              
csi-attacher I0620 17:23:21.202986       1 csi_handler.go:218] CSIHandler: processing VA "csi-0ba6c8dc618582bf2ec444222752944fa0b6b9139adecbb62d82b69191a4612e"
csi-attacher I0620 17:23:21.202989       1 csi_handler.go:269] Starting detach operation for  "csi-0ba6c8dc618582bf2ec444222752944fa0b6b9139adecbb62d82b69191a4612e"                                                                                                                     
csi-attacher I0620 17:23:21.203007       1 csi_handler.go:276] Detaching "csi-0ba6c8dc618582bf2ec444222752944fa0b6b9139adecbb62d82b69191a4612e"                                                                                                                                         
csi-attacher I0620 17:23:21.203041       1 csi_handler.go:742] Found NodeID home-workers-1cb93b3e-k6559 in CSINode home-workers-1cb93b3e-k6559 
csi-attacher I0620 17:23:21.203061       1 connection.go:182] GRPC call: /csi.v1.Controller/ControllerUnpublishVolume                                                                                                                                                                   
csi-attacher I0620 17:23:21.203065       1 connection.go:183] GRPC request: {"node_id":"home-workers-1cb93b3e-k6559","volume_id":"pvc-f969d272-d61d-4091-a1df-f484cc680753"}                                                                                                            
csi-attacher I0620 17:23:21.217166       1 csi_handler.go:620] Saved detach error to "csi-0f6065cdc387a9f529c68ab92477210efa2880a7d340fae4858dc40a362adfcd"                                                                                                                             
csi-attacher I0620 17:23:21.217198       1 csi_handler.go:228] Error processing "csi-0f6065cdc387a9f529c68ab92477210efa2880a7d340fae4858dc40a362adfcd": failed to detach: rpc error: code = Internal desc = Failed to remove volume pvc-97b512ba-0ff4-48e2-8a65-939e0fbcbc72 from node home-workers-1cb93b3e-s89xk: Operation cannot be fulfilled on virtualmachine.kubevirt.io "home-workers-1cb93b3e-s89xk": Unable to remove volume [pvc-97b512ba-0ff4-48e2-8a65-939e0fbcbc72] because it does not exist 

To Reproduce Steps to reproduce the behavior:

  1. Go to harvester and gracefully shutdown all VMs of a child cluster.
  2. Manually remove all PVCs mounted to all VMs
  3. Power on all VMs

Expected behavior Harvester CSI will see certain volumes are no longer attached to their respective nodes and take the necessary actions to resolve this.

Support bundle supportbundle_774ced99-ead3-43ed-b689-7136613f6eb2_2024-06-20T18-59-04Z.zip

Environment

Additional context Add any other context about the problem here.

Daemonslayer2048 commented 1 week ago

With some testing it looks like manually adding the PVC volumes back to the correct VMs resolves the issue. However I think a non-manual solution should still be implemented.