hetznercloud / hcloud-cloud-controller-manager

Kubernetes cloud-controller-manager for Hetzner Cloud
Apache License 2.0
732 stars 118 forks source link

Missing route cleaning with autoscaler on scale down ? #231

Closed guyguy333 closed 3 years ago

guyguy333 commented 3 years ago

Initial report: https://github.com/kubernetes/autoscaler/issues/4049

Which component are you using?: Autoscaler with Hetzner as cloud provider

What version of the component are you using?: Only available on master, no Docker image include Hetzner yet.

Component version: 1.21

What k8s version are you using (kubectl version)?: 1.21.0 with k3s

kubectl version Output
$ kubectl version
v1.21.6+k3s1

What environment is this in?: Hetzner Cloud using CPX31 machines for master and node.

What did you expect to happen?: I expect pod network work after a scale down of nodes and a scale up.

What happened instead?: Pod network is broken. DNS resolution is broken. I'm not sure, but I think POD routes associated to node should be removed after scale down in order to avoid bad reconciliation after scale up with a new node using the same private IP. I can see routes (10.4x.xx.xx) are not removed after scale down and I think they're badly reused resulting in broken pod network.

How to reproduce it (as minimally and precisely as possible): It can be reproduced using https://registry.terraform.io/modules/cicdteam/k3s/hcloud/latest and then deploying autoscaler. Try to scale down at the maximum of pool size, then scale down. Scale up again and HCloud CSI node will reboot and all pods scheduled on new nodes will have network issues.

Anything else we need to know?: In order to solve the issue manually, I've to remove nodes, remove all routes except master node route, restart autoscaler and things work again, until the next scale down / scale up. This is what let me think route should be removed on scale down.

LKaemmerling commented 3 years ago

Hey @guyguy333 can you please provide logs from the cloud controller (and auto scaler)?

guyguy333 commented 3 years ago

Hey @LKaemmerling, sure here are the logs:

ccm.log

autoscaler.log

The issue happened this morning so I limited logs to this morning only but we can see route errors.

Please let me know if you need more details.

LKaemmerling commented 3 years ago

Hey @guyguy333,

i had time to dig in your logs and i think i found the reason. Our cache basically produces the issues. I will look if i can find a solution.

guyguy333 commented 3 years ago

Great, thanks @LKaemmerling for taking time to dig in logs

LKaemmerling commented 3 years ago

Hey @guyguy333,

we found a solution. The fix will be included in the next release.