[bug] csi azuredisk node is crashlooping when ~20 pods with PV are scheduled at the same time

kubernetes-sigs / azuredisk-csi-driver

Azure Disk CSI Driver

Apache License 2.0

147 stars 192 forks source link

[bug] csi azuredisk node is crashlooping when ~20 pods with PV are scheduled at the same time #2416

Closed grzesuav closed 4 months ago

grzesuav commented 4 months ago

What happened:

Csi azuredisk pod is crashlooping preventing volumes from being mounted and pod stays in Pending state forever.

Details in https://github.com/Azure/AKS/issues/4421

What you expected to happen:

It works

How to reproduce it:

Schedule 20 pods with PV as the same time to one node

Anything else we need to know?:

N/A

Environment:

CSI Driver version: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.8
Kubernetes version (use kubectl version): 1.27.x

andyzhangx commented 4 months ago

@grzesuav what are you pv size? pls email me your cluster fqdn, and we will increase the memory limit instantly. btw, a fix (with 600 Mi memory limit) is already rolling out in 20240716 release.

grzesuav commented 4 months ago

As you can see our instances are being killed when reaching 1200 Mb. We have admission webhook so I am able to increase limit myself, I reported the issue as I expected some code changes to address that - like rate limiting volume operations if they cannot be executed concurrently etc.

Why the PV size matters ?

andyzhangx commented 4 months ago

we found similar issue when formatting a big pv (e.g. 6TB), anyway, increasing to 600Mi memory limit should help, if you increase the csi driver memory limit by yourself on the daemonset, it would be reverted. le me know if you want to increase that limit by aks team, thx.

grzesuav commented 4 months ago

most of our disks were 16GB
we have mutating webhook for pods, it works and won't be reverted