Closed maaft closed 1 month ago
The reason might have been a combination of "lock-ttl" : "30m" in kured config and PDB that stops a node from draining.
Closing for now.
For posterity. I was playing about and encountered this with a single node in a nodepool.
The update pods can't be scheduled on the control plane nodes, so the sole node gets cordoned and the system-upgrade namespace pods get stuck waiting to start as they want to run on the cordoned pod.
Description
My nodes are randomly cordoned.
Logs of kured show that some nodes are sucessfully uncordoned, but other ones not and stay unscheduleable
Example for node that stays Cordoned:
Example for node that successfully gets uncordoned:
The main difference in these logs is that one has the line
time="2024-10-05T01:03:01Z" level=info msg="Uncordoning node staging-system-3-eeb"
while the other one has not. Otherwise, logs are identical.Any idea?
Kube.tf file
Screenshots
No response
Platform
Linux