Volume detach race condition

mfranczy commented 1 year ago

Is this a BUG REPORT or FEATURE REQUEST?: /kind bug

What happened: While draining KubeVirt infrastructure nodes (bare-metal nodes) we evict virtual machines workload to another vm nodes. Sometimes it happens that during eviction process the recreated pod on different vms gets error (not always):

Warning FailedAttachVolume 3m40s attachdetach-controller Multi-Attach error for volume "pvc-04bf24ee-a755-4bee-bbcb-559aca75d862" Volume is already exclusively attached to one node and can't be attached to another

Further investigation showed that this is because a volumeattachment resource is not being deleted due to error coming from the KubeVirt CSI driver:

Detach Error: Message: rpc error: code = NotFound desc = failed to find VM with domain.firmware.uuid e08f36d8-de8d-5365-a683-5f43a5be323a Time: 2023-02-09T15:45:00Z

There is a race condition between VM and volume attachment deletion.

What you expected to happen: Volume attachment resource for non existing vms deleted.

How to reproduce it (as minimally and precisely as possible):

Create a simple pod with PVC that is using storage class with KubeVirt CSI driver.
Drain KubeVirt infrastracture node
Immediately, after the previous pod has been deleted create another one with the same pvc.

That's the easiest way (with deployment it's harder to spot the bug as it depends on reconciliation timing).

Anything else we need to know?: I think this is a problematic part of the code: https://github.com/kubevirt/csi-driver/blob/main/pkg/service/controller.go#L320-L323

Environment:

KubeVirt CSI version: commit cc71b72b8d5a205685985244c61707c5e40c9d5f
Kubernetes version (use kubectl version): Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.6", GitCommit:"ad3338546da947756e8a88aa6822e9c11e7eac22", GitTreeState:"clean", BuildDate:"2022-05-19T19:39:28Z", GoVersion:"go1.17.9", Compiler:"gc", Platform:"linux/amd64"}
Cloud provider or hardware configuration:
Install tools: Kubermatic platform, however, it's not platform related.
Others:

mfranczy commented 1 year ago

I think the easiest fix for that would be to not return an error when VM is not found. This would allow CSI controller to remove the volume attachment.

I could work on it if you approve the solution.

awels commented 1 year ago

So if I understand this correctly the CSI driver cannot find the VMI, because it no longer exists due to the drain on the infra node, and a new VMI exists on a different node, due to the live migration? And because the now original VMI no longer exists the CSI driver is throwing that error because it is trying to unpublish from a node that is not longer there.

Instead of ignoring that, maybe we should improve the mechanism that finds the right VMI so it would find the VMI on the new infra node?

mfranczy commented 1 year ago

because it no longer exists due to the drain on the infra node, and a new VMI exists on a different node, due to the live migration?

The VMI is not migrated, it is simply re-created. We have eviction strategy set to External .

And because the now original VMI no longer exists the CSI driver is throwing that error because it is trying to unpublish from a node that is not longer there.

Exactly.

Instead of ignoring that, maybe we should improve the mechanism that finds the right VMI so it would find the VMI on the new infra node?

Since we don't live migrate the VMI but recreate.. I don't think that finding it again would help. However, maybe I miss something. My reasoning is that a new VMI will not have the volume attached anyway. Executing unpublish on the VMI that doesn't have the volume would probably return an error again (that's my guess, I would have to take a closer look to the implementation).

awels commented 1 year ago

Okay, so if you recreate the VMI then simply ignoring the error would be correct, but if someone live migrates the node, then I am not sure if simply ignoring is the correct course of action. The driver should do the right thing in both scenarios IMO.

mfranczy commented 1 year ago

Understood. So when working on the fix I will consider the eviction strategy. Thanks.

awels commented 1 year ago

I just checked the getVMNameByCSINodeID function and it looks up the VMI by vmi.Spec.Domain.Firmware.UUID not by name or anything like that. So a live migration this should not error at all. And if you re-creating the VMI should it not also have the same Firmware.UUID? I am not 100% sure if that statement is true or not.

awels commented 1 year ago

Just digging a little deeper, the nodeID is the node.Status.NodeInfo.SystemUUID maybe we should look for node.status.volumesAttached instead? and if not found, should be like okay, we already detached the volume.

mfranczy commented 1 year ago

maybe we should look for node.status.volumesAttached instead? and if not found, should be like okay, we already detached the volume.

Sounds promising, I will dig into it and see if there are problems with it.

mfranczy commented 1 year ago

And if you re-creating the VMI should it not also have the same Firmware.UUID

That would block the workload eviction cause to make sure there is no volume attached we would have to wait for a new VMI node to join the cluster.

I like your idea with node.status.volumesAttached.

awels commented 1 year ago

The name of the attached volume in the status is part csi driver name and part volume name, so I would match on both csi driver and volume name.

kubevirt-bot commented 1 year ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

mfranczy commented 1 year ago

/remove-lifecycle stale

kubevirt-bot commented 10 months ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

mfranczy commented 10 months ago

/remove-lifecycle stale

kubevirt-bot commented 7 months ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubevirt-bot commented 6 months ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

awels commented 6 months ago

/remove-lifecycle stale

awels commented 6 months ago

/remove-lifecycle rotten

kubevirt-bot commented 3 months ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

kubevirt-bot commented 2 months ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

haoxiaoci commented 2 months ago

Looking forward to it being fixed too...

awels commented 2 months ago

/remove-lifecycle rotten

awels commented 2 months ago

/lifecycle frozen

haoxiaoci commented 2 months ago

my case is k8s master has been recreated, after k8s master get ready, ADcontroller check the volume is attached already, node.status.Volumeattached still have that test pvc, then volume-manager calls NodeStageVolume and failed mounted. keep retrying... and then pod becomes unknown, it seems like the same problem as issue 83

It seems that this situation cannot be judged using node.status.VolumeAttached

Just digging a little deeper, the nodeID is the node.Status.NodeInfo.SystemUUID maybe we should look for node.status.volumesAttached instead? and if not found, should be like okay, we already detached the volume.

kubevirt / csi-driver

Volume detach race condition #83