Volume is not attached to the instance, but VolumeAttachment is existent already

Preisschild commented 10 months ago

TL;DR

We sometimes have Pods with PVCs stuck in the ContainerCreating stage forever because mounting the filesystem doesn't work because the Hetzner Volume isn't attached to the instance in the first place

Expected behavior

Volumes should always be attached automatically

Observed behavior

VolumeAttachment is existent and kubelet tries to mount the device with luks, but can't find the device since it isn't attached to the Instance

Minimal working example

I'm not sure how to reproduce this yet, I think it may be a race issue when a pod is being evicted from a node and then scheduled onto a new node

Log output

Pod (with PVC mount) events:

  Normal   Scheduled               22m                  default-scheduler        Successfully assigned logging/loki-backend-1 to production-worker-cpx41-talos-v1-6-3-rhnd2
  Warning  FailedAttachVolume      22m                  attachdetach-controller  Multi-Attach error for volume "pvc-6e95d0dd-2bb7-456f-9c91-652ca2c37b19" Volume is already exclusively attached to one node and can't be attached to another
  Warning  FailedAttachVolume      22m (x2 over 22m)    attachdetach-controller  AttachVolume.Attach failed for volume "pvc-6e95d0dd-2bb7-456f-9c91-652ca2c37b19" : rpc error: code = Unavailable desc = failed to publish volume: server is locked
  Normal   SuccessfulAttachVolume  22m                  attachdetach-controller  AttachVolume.Attach succeeded for volume "pvc-6e95d0dd-2bb7-456f-9c91-652ca2c37b19"
  Warning  FailedMount             110s (x18 over 22m)  kubelet                  MountVolume.SetUp failed for volume "pvc-6e95d0dd-2bb7-456f-9c91-652ca2c37b19" : rpc error: code = Internal desc = failed to publish volume: unable to format device /dev/disk/by-id/scsi-0HC_Volume_100094245 with LUKS: Device /dev/disk/by-id/scsi-0HC_Volume_100094245 does not exist or access denied.

hcloud-csi-controller:

level=info ts=2024-02-02T11:26:27.771643845Z component=api-volume-service msg="detaching volume from server" volume-id=100094245 server-id=42662172
level=info ts=2024-02-02T11:26:31.415375267Z component=api-volume-service msg="attaching volume" volume-id=100094245 server-id=42662212
level=info ts=2024-02-02T11:26:32.629127262Z component=api-volume-service msg="failed to attach volume" volume-id=100094245 server-id=42662212 err="cannot perform operation because volume is locked (locked)"
level=info ts=2024-02-02T11:26:32.732865407Z component=api-volume-service msg="attaching volume" volume-id=100094245 server-id=42662212
level=info ts=2024-02-02T11:26:33.590758558Z component=api-volume-service msg="failed to attach volume" volume-id=100094245 server-id=42662212 err="cannot perform operation because volume is locked (locked)"
level=info ts=2024-02-02T11:26:33.737412265Z component=api-volume-service msg="attaching volume" volume-id=100094245 server-id=42662212
level=info ts=2024-02-02T11:26:34.669116169Z component=api-volume-service msg="volume is already attached to this server" volume-id=100094245 server-id=42662212

Additional information

I don't know why it logs cannot perform operation because volume is locked (locked). I did not manually lock the disk in the Hetzner Console.

No response

apricote commented 10 months ago

I don't know why it logs cannot perform operation because volume is locked (locked). I did not manually lock the disk in the Hetzner Console.

When you make changes to Volumes (servers, ...) the resource is "locked" in the API, this lock is lifted after the action is finished.

This looks pretty weird to me, so I took a look at the API requests & Actions in our backend. I noticed 3 things, which in tandem might explain the symptoms:

Volume Incident

We have elevated error rates on attaching volumes right now, might be related: https://status.hetzner.com/incident/e50d7b28-d0a7-4014-ba14-be70f46032ff

Multiple detach requests

I saw 3 requests for detaching, arriving within tens of milliseconds. Are you by any chance running 3 replicas of the controller without enabling the leader election?

If you want to run with multiple replicas, please make sure to enable leader election on all sidecar containers https://github.com/hetznercloud/csi-driver/blob/604cbb3fed1b4b045c76d1b99f6ef4ec2c2eab01/chart/templates/controller/deployment.yaml#L83-L86

Failed attach

Its not indicated in your logs, but I found a single attach action that was marked as failure. Possibly related to the previous two problems, as one of the other replicas might have processed that attach.

Preisschild commented 10 months ago

Ah that explains it. It was running with multiple replicas, but without leader election enabled. Thank you.

Closing this issue

apricote commented 10 months ago

That at least explains the missing log messages and locked volumes. Should be better with a single replica doing the work, then the events & logs should be cleaner to understand.

hetznercloud / csi-driver