Closed simonostendorf closed 3 months ago
@simonostendorf please elaborate a bit more in which use-case you encountered the error. What kind of cluster did you create, how many control-planes?
Do you have logs/events/conditions where the error is visible?
You propose to extend the CRD HetznerCluster
, is that correct?
You propose to extend the CRD
HetznerCluster
, is that correct?
Yes.
please elaborate a bit more in which use-case you encountered the error. What kind of cluster did you create, how many control-planes?
I didn't get any errors as far as I can remember but I heard that other clusters (e.g. openstack) have problems at cluster updates where the api is unavailable because the lb adds the nodes before the api server is ready and I think to prevent errors like this and to use all the features provided by the hcloud lb, this should be added to the HetznerCluster
as its a better check than tcp. I usually use 3 hcloud nodes as control plane.
I heard that other clusters (e.g. openstack) have problems at cluster updates where the api is unavailable because the lb adds the nodes before the api server is ready
@simonostendorf I hope that would be a problem here ;)
But what do you do if the apiserver becomes unhealthy later and the port is still reachable via tcp?
The initial problem is solved. I didn't check the code, sorry :)
@simonostendorf I don't have all the details in my mind. But overall the code is readable. Could you please have a look at the code?
If you don't understand the code, then please tell us! The code should be readable, if comments are missing, we want to add them.
Is that ok for you?
@guettli what would be your answer to this question here?
But what do you do if the apiserver becomes unhealthy later and the port is still reachable via tcp?
If this is the case, you should usually have a handler in your stack to detect this and mark the node as unhealthy. Then you can configure machine health checks or a custom remediation strategy to fix this problem.
Usually the port will be unavailable and then the LB healthcheck will see this as well and stop routing traffic to that node. So this would be a very very rare edge that would only happen if the port is still available and for the time until your stack detects the problem and only then it will affect 1/3 or 1/5 of the connections for a small period of time.
When using Syself Autopilot: You would be affected on 1/5 of the connections for about 3 minutes. Actually I never saw a hanging port on the kube-apiserver that was not detected by the loadbalancer health check, so in the real world I guess it would be lower than 30s. We also use a reboot strategy, so the node will probably be back up in less than a minute in most cases
@simonostendorf we have not heard from you since some weeks. Were you able to solve your issue? Please tell us!
I will close the issue now. Please re-open it or create a new issue, if something is not working or not easy to understand - thank you!
@simonostendorf we have not heard from you since some weeks. Were you able to solve your issue? Please tell us!
I will close the issue now. Please re-open it or create a new issue, if something is not working or not easy to understand - thank you!
Sorry for the late reply. I haven't got the time to test the lbaas failure behavior. I will reopen this issue if I have any problems with that.
@guettli
I tested the feature now: the node is added to the LB after it is marked "ready" or if it is the first node (this is the result of my test. I haven't checked if this is intended and correct).
This works for adding the node to the LB.
But I could not figure out what happens if the api server of, say, node-3 goes down from time to time. A http healthcheck would catch this and remove the node from load balancing. The caph controller does not seem to do this. So maybe supporting http healthchecks is still a good idea, but not as important as first thought.
Your observations are correct @simonostendorf . We do have a check like this in the code that you have the exact behavior you describe. The first node has to be added and after that we wait until that node is actually reachable. No comment on the healthcheck from my side though :sweat_smile:
@simonostendorf
But I could not figure out what happens if the api server of, say, node-3 goes down from time to time. A http healthcheck would catch this and remove the node from load balancing.
I am unsure about which component is responsible for that. Maybe this is already handled at a higher level (cluster-api, kubernetes). Can you please check how other cluster-api infra providers do that?
@simonostendorf
But I could not figure out what happens if the api server of, say, node-3 goes down from time to time. A http healthcheck would catch this and remove the node from load balancing.
I am unsure about which component is responsible for that. Maybe this is already handled at a higher level (cluster-api, kubernetes). Can you please check how other cluster-api infra providers do that?
I don't think that's an infra provider responsibility. I just wanted to clarify why supporting http healthchecks is still a good idea but not as necessary as initially thought.
/kind feature
Describe the solution you'd like I would like to use a http healthcheck for the api lb. This prevents the healthcheck to be ready before the apiserver itself is ready, because the port is opened before the apiserver is ready (and tcp healthcheck will be ready before apiserver is ready).
Anything else you would like to add: Maybe this could be added like this:
if omitted use the hetzner default values (tcp, 15s interval, 10s timeout, 3 retries)