Failed to recover after node took the drive (volume) offline

SidAtQuanos commented 5 months ago

TL;DR

Hi there,

we had a kubernetes volume autoscaling mechanism active in our cluster (devops-nirvana). For a while the autoscaling worked just fine. At some point the operating system of the kubernetes node took the drive offline. The csi driver failed to recognize the offline state and continued with increasing the PVC

Expected behavior

The csi driver is capable to react on offline drives in means of taking the drive online again, so that an actual resize2fs works just fine. Or alternatively the driver should not allow the increase of PVC if the corresponding drive is set offline

Observed behavior

PVC got increased. PV got increased. resize2fs was triggered but failed. Next iteration. (as autoscaler checks the existing quota which did not change, so the PVC got increased again and again)

CSI Driver seems to be totally unaware of these problems (see second log snippet).

In the minio pod log we faced these messages: Error: node(: taking drive /export offline: unable to write+read for 2m0.001s (errors.errorString) 3: internal/logger/logger.go:248:logger.LogAlwaysIf() 2: cmd/xl-storage-disk-id-check.go:1066:cmd.(xlStorageDiskIDCheck).monitorDiskWritable.func1.1() 1: cmd/xl-storage-disk-id-check.go:1085:cmd.(*xlStorageDiskIDCheck).monitorDiskWritable.func1.2()

Minimal working example

No response

Log output

Looked up in journalctl

Apr 07 20:09:58 node-name kernel: sd 2:0:0:1: Capacity data has changed
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: [sdb] tag#91 abort
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: device reset
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: [sdb] tag#91 abort
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: Device offlined - not ready after error recovery
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: [sdb] tag#91 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=20s
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: [sdb] tag#91 CDB: Write(10) 2a 00 12 a7 d7 e0 00 00 20 00
Apr 07 20:10:19 node-name kernel: blk_update_request: I/O error, dev sdb, sector 312989664 op 0x1:(WRITE) flags 0x800 phys_seg 4 prio class 0
[... many more errors like that ... ]
Apr 07 20:10:19 node-name kernel: sd 2:0:0:1: rejecting I/O to offline device

=====
Apr 07 19:30:19 node-name kubelet[919]: I0407 19:30:19.030108     919 operation_generator.go:2236] "MountVolume.NodeExpandVolume succeeded for volume \"pvc-<pvc-id>\" (UniqueName: \"kubernetes.io/csi/c
si.hetzner.cloud^<id>\") pod \"<pod-name>\" (UID: \"<uuid>\") node-name" pod="<podnamespace/podname>"
Apr 07 19:45:49 node-name kernel: EXT4-fs (sdb): resizing filesystem from 26738688 to 34865152 blocks
Apr 07 19:45:49 node-name kernel: EXT4-fs (sdb): resized filesystem to 34865152
Apr 07 19:45:49 node-name kubelet[919]: I0407 19:45:49.435021     919 operation_generator.go:2236] "MountVolume.NodeExpandVolume succeeded for volume \"pvc-<pvcid>\" (UniqueName: \"kubernetes.io/csi/c
si.hetzner.cloud^<id>\") pod \"<podname>\" (UID: \"<uuid>\") node-name" pod="<podnamespace/podname>"
Apr 07 19:57:59 node-name kernel: EXT4-fs (sdb): resizing filesystem from 34865152 to 45350912 blocks
Apr 07 19:57:59 node-name kernel: EXT4-fs (sdb): resized filesystem to 45350912

and after the issue occurred
Apr 07 19:57:59 node-name kubelet[919]: I0407 19:57:59.488576     919 operation_generator.go:2236] "MountVolume.NodeExpandVolume succeeded for volume \"pvc-<pvd-id>\" (UniqueName: \"kubernetes.io/csi/c
si.hetzner.cloud^<id>\") pod \"<podname>\" (UID: \"<uuid>\") node-name" pod="<podnamespace/podname>"
Apr 07 20:10:52 node-name kubelet[919]: I0407 20:10:52.301601     919 operation_generator.go:2236] "MountVolume.NodeExpandVolume succeeded for volume \"pvc-<pvc-id>\" (UniqueName: \"kubernetes.io/csi/c
si.hetzner.cloud^<id>\") pod \"<podname>\" (UID: \"<uuid>\") node-name" pod="<podnamespace/podname>"

Additional information

Kubernetes: 1.29.1 csi driver: 2.6.0

apricote commented 2 months ago

Hey @SidAtQuanos,

sorry for the very late response.

Is this error still happening?
Are you using Block or Filesystem volumes? (persistentVolume.spec.volumeMode)
If this is still happening, could you send us the logs of the csi-drivers Node & Controller?
Are you using the LUKS Encryption feature?

SidAtQuanos commented 2 months ago

Hi @apricote ,

thanks for looking into this. Honestly, when this error happend having "VolumeAutoScaler" active one of the first things we did was disabling the volume auto scaling. And we try to avoid having frequent auto scales like we had that time when those two errors came up.

So ...

I can't tell if it's still happening. But I would assume yes. It was however when having frequent rescales (increasing 10 Gb to around 240 Gb with 5 min delay and 20% increase max, with 6 volumes that had to be scaled one after each other). And at some point one of such volumes had the effect of the driving being offline at some point.
Volume Mode: Filesystem
As not "attempting to run into this error": no fresh logs
No LUKS. Files are encrypted by minio kms (not on file system level)

Unfortunately I do not have live logs available. Would these infos be sufficient?

apricote commented 2 months ago

Is this the VolumeAutoScaler you were using? I can try to reproduce the issue locally with it, but it sounds like this is very flaky and might not happen to me :(

https://github.com/DevOps-Nirvana/Kubernetes-Volume-Autoscaler

SidAtQuanos commented 2 months ago

Yes, I can confirm this is the volume autoscaler.

This is the configuration

verbose: "false" dry_run: "false" prometheus_url: interval_time: "60" #Seconds scale_after_intervals: "3" scale_above_percent: "80" scale_up_percent: "30" scale_up_min_increment: "1000000000" #Bytes -> 1GB scale_cooldown_time: "300" #Seconds resources: limits: cpu: 20m memory: 75Mi requests: cpu: 20m memory: 75Mi

The minio deployment had 6 pods/pvc, start size of the volume was 10 Gi. The target size of when the autscale stopped was around '241484536217'. One other was at '314069483520'.

I copied over 450 Gi of data from another cluster using mc. I tried to stop and restart the script in time to give the autoscaler the time to react (whenever all pvc according to grafana reached more than 80% of usage, and not always was i successful. First attempt was a full disk before I noticed). After all files have been copied to minio, all but one pvc had their final size. One pvc was at around 81% or 83% of usage and was in the loop of eternal resizes due to the disk being offline.

It was quite a few iterations. And one 1 out of 6 pvc was affected. In its last iteration.

Maybe this helps to understand the setup/pressure/...

SidAtQuanos commented 2 months ago

Maybe this approach may work https://www.perplexity.ai/search/apr-07-20-09-58-node-name-kern-SjE1Y2_MTFqvlxTRtybs.A

So possibly not try-hard on reproducing the exact same scenario (or pretty much similiar) but rather take the disk offline manually (when its more than 80% of usage). Possibly installing the auto scaler only after taking the disk offline may be sufficient to still have it trigger the resize and therefore the csi-driver.

hetznercloud / csi-driver