Persistent volumes get created in 1 availability zone only, causing unwanted pod affinity

w-le commented 2 years ago

AKS clusters are often deployed with 3 nodes in 1 region - 1 node in each availability zone of that region. (e.g. us-west-1, us-west-2, us-west-3). Ideally we want pods to have no affinity to any specific node - pods should be able to run on ANY node (any AZ).

However the helm charts currently define the persistent volumes to exist in just 1 az instead of all availability AZs in the chosen region. This causes the pod to have node affinity e.g. core will only ever run on node 1 since it's PV exists only in us-west-1

What's worse is so that by default all the PVs get created in the first AZ only, thus all pods with any PV will always run on node 1, causing node1 to have HIGH memory pressure even when nodes 2 and 3 have plenty of memory capacity available. This has caused consistent pod eviction, usually for influx but even for core at times.

https://stackoverflow.com/questions/68545583/what-is-the-correct-pvc-configuration-in-aks-for-multi-zone-storage

w-le commented 1 year ago

Let me know what your thoughts are on this @viv-4

viv-4 commented 7 months ago

This is actually a temporary state which occurs when a pod with a volume is scheduled to a different node that is resolved by the hosting provider, in the case of Azure usually 1-5 minutes The issues with memory pressure on nodes running core and/or influx has since been resolved with resource limit definitions

place-labs / k8s-helm

Persistent volumes get created in 1 availability zone only, causing unwanted pod affinity #29