Closed guyguy333 closed 3 years ago
Hey @guyguy333 can you please provide logs from the cloud controller (and auto scaler)?
Hey @LKaemmerling, sure here are the logs:
The issue happened this morning so I limited logs to this morning only but we can see route errors.
Please let me know if you need more details.
Hey @guyguy333,
i had time to dig in your logs and i think i found the reason. Our cache basically produces the issues. I will look if i can find a solution.
Great, thanks @LKaemmerling for taking time to dig in logs
Hey @guyguy333,
we found a solution. The fix will be included in the next release.
Initial report: https://github.com/kubernetes/autoscaler/issues/4049
Which component are you using?: Autoscaler with Hetzner as cloud provider
What version of the component are you using?: Only available on master, no Docker image include Hetzner yet.
Component version: 1.21
What k8s version are you using (
kubectl version
)?: 1.21.0 with k3skubectl version
OutputWhat environment is this in?: Hetzner Cloud using CPX31 machines for master and node.
What did you expect to happen?: I expect pod network work after a scale down of nodes and a scale up.
What happened instead?: Pod network is broken. DNS resolution is broken. I'm not sure, but I think POD routes associated to node should be removed after scale down in order to avoid bad reconciliation after scale up with a new node using the same private IP. I can see routes (10.4x.xx.xx) are not removed after scale down and I think they're badly reused resulting in broken pod network.
How to reproduce it (as minimally and precisely as possible): It can be reproduced using https://registry.terraform.io/modules/cicdteam/k3s/hcloud/latest and then deploying autoscaler. Try to scale down at the maximum of pool size, then scale down. Scale up again and HCloud CSI node will reboot and all pods scheduled on new nodes will have network issues.
Anything else we need to know?: In order to solve the issue manually, I've to remove nodes, remove all routes except master node route, restart autoscaler and things work again, until the next scale down / scale up. This is what let me think route should be removed on scale down.