KubeVirt node eviction leaves VolumeAttachment stuck to deleted Node

embik commented 1 year ago

What happened?

While testing #11736, I created a PVC to make sure that evicting a virt-launcher pod would allow me to reschedule workloads with storage within the KubeVirt user cluster.

However, I noticed that a Pod trying to mount a volume that was attached to a node evicted on the KubeVirt infra side (the node-eviction-controller drains and deletes the VM and Node object) is stuck with:

Warning  FailedAttachVolume  3m40s  attachdetach-controller  Multi-Attach error for volume "pvc-04bf24ee-a755-4bee-bbcb-559aca75d862" Volume is already exclusively attached to one node and can't be attached to another

I looked for volumeattachment resources and found this one:

NAME                                                                   ATTACHER          PV                                         NODE                                        ATTACHED   AGE
csi-6b42e564b2e31809881c86d5385e7711d0c094bb60039095d14178daabc6ecc0   csi.kubevirt.io   pvc-04bf24ee-a755-4bee-bbcb-559aca75d862   zhtjh9blrt-worker-w8z64w-5f679f4c95-68tvr   true       10m

This references a node no longer existing. Looking at the volume attachment in detail, it has a deletion timestamp and this is the status of it:

status:
    attachError:
      message: 'rpc error: code = Unknown desc = Operation cannot be fulfilled on
        virtualmachineinstance.kubevirt.io "zhtjh9blrt-worker-w8z64w-5f679f4c95-68tvr":
        Unable to add volume [pvc-04bf24ee-a755-4bee-bbcb-559aca75d862] because it
        already exists'
      time: "2023-02-09T13:17:19Z"
    attached: true
    detachError:
      message: 'rpc error: code = NotFound desc = failed to find VM with domain.firmware.uuid
        6d9a9661-0871-5893-9d13-60a352d74d6e'
      time: "2023-02-09T13:27:11Z"

Expected behavior

The volume can be re-mounted on another node since the initial pod and node both got terminated.

How to reproduce the issue?

Create KubeVirt user cluster on QA.
Create PVC and Pod from manifests provided below ("Provide your KKP manifests").
Wait for PVC and Pod to be created, scheduled and started.
Use kubectl-evict on the KubeVirt infra cluster, targeting the virt-launcher Pod that is hosting the Node that our app Pod got scheduled to.
Wait for the node to be drained and a new node joining the cluster.
Re-apply the Pod manifest, trying to mount the PVC that should be mountable since no other active Pod mounts it.
Observe Pod not starting.

How is your environment configured?

KKP version: v2.22.0-alpha.0
Shared or separate master/seed clusters?: shared

Provide your KKP manifest here (if applicable)

```yaml apiVersion: v1 kind: PersistentVolumeClaim metadata: name: pvc spec: accessModes: - ReadWriteOnce resources: requests: storage: 4Gi --- apiVersion: v1 kind: Pod metadata: name: app spec: containers: - name: app image: centos command: ["/bin/sh"] args: ["-c", "while true; do echo $(date -u) >> /data/out.txt; sleep 5; done"] volumeMounts: - name: persistent-storage mountPath: /data volumes: - name: persistent-storage persistentVolumeClaim: claimName: pvc ```

What cloud provider are you running on?

KubeVirt

What operating system are you running in your user cluster?

Ubuntu 22.04

Additional information

mfranczy commented 1 year ago

I will work on the issue upstream https://github.com/kubevirt/csi-driver/issues/83, that should not block the KKP 2.22 release.

kubermatic-bot commented 1 year ago

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

embik commented 1 year ago

/remove-lifecycle stale

kubermatic-bot commented 1 year ago

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

embik commented 1 year ago

/remove-lifecycle stale

kubermatic-bot commented 9 months ago

Issues go stale after 90d of inactivity. After a furter 30 days, they will turn rotten. Mark the issue as fresh with /remove-lifecycle stale.

If this issue is safe to close now please do so with /close.

/lifecycle stale

embik commented 9 months ago

/remove-lifecycle stale

csengerszabo commented 2 months ago

/remove-priority high

csengerszabo commented 2 months ago

/milestone clear

kubermatic / kubermatic