Usefulness for load balancing purpose?

jichen-amplify commented 3 years ago

Hi, we have a load balancer sitting in front of our Keycloak cluster. As a new Keycloak instance is just starting and trying to join the cluster, we would like the load balancer to start forwarding requests to this new Keycloak instance only when the new Keycloak instance is healthy and has successfully joined the cluster. Can these health check endpoints be used by the load balancer to detect whether the new Keycloak instance is ready to receive any live requests?

polarizeme commented 3 years ago

Hi, we have a load balancer sitting in front of our Keycloak cluster. As a new Keycloak instance is just starting and trying to join the cluster, we would like the load balancer to start forwarding requests to this new Keycloak instance only when the new Keycloak instance is healthy and has successfully joined the cluster. Can these health check endpoints be used by the load balancer to detect whether the new Keycloak instance is ready to receive any live requests?

My understanding is "absolutely" and it's exactly what I'm about to test in DEV for my team. You could set your LB to target the infinispan healthcheck endpoint specifically on your configured port and expect a 200. If an instance fails w/ a 503, it means that instance isn't yet recognized as part of the cluster. You could manually confirm in testing by curling the endpoint while it's failing a healthcheck and seeing if the "numberOfNodes" = what it should be and that "nodeNames" lists all the necessary hosts.

Can't imagine it'll be today or maybe not even tomorrow, but I'll set a reminder to come back and update you once I've tested on our end. Cheers!

jichen-amplify commented 3 years ago

Thanks for responding to me.

We have actually tested this by spinning up a new instance while running a load test on our test Keycloak cluster. What ended up happening was seemingly that the new instance would caused the whole cluster to get in an inconsistent state and all requests would start failing. This could very well be due to how we set up our cluster. I posted the question so that we could rule out it's not because this health check extension doesn't support this feature.

Please let us know your test result. Thanks again!

polarizeme commented 3 years ago

No worries! I've been down a monitoring hellpath for the past 24hr so it just felt nice to see someone in a similar situation haha.

How is your cluster setup? Is your LB's healthcheck trying to hit a shared cluster endpoint, or do you have it running a health check on all instances/containers in the cluster?

Ours is set up so that the LB is running the healthcheck on each instance in the cluster, and we've got some failure intervals in place so that it's not considered an unhealthy instance until 3 checks in a row have failed.

Our hope is that we can just reconfigure our healthchecks to point to the infinispan cluster status endpoint for each instance. So a new instance coming into the cluster wouldn't trigger any scaling or replacement events unless it simply had big issues joining the cluster to begin with. Otherwise the expected behaviour would be to reach healthy state, at which point we know the cluster status endpoint for that instance is healthy and we can expect the instance is handling cluster traffic.

I'll be sure to get back to you once we've got our test cluster in place or we test on our DEV cluster. =]

jichen-amplify commented 3 years ago

We have the same approach as yours. We are currently running our cluster in AWS EC2 as an auto scaling group (ASG). We have a load balancer sitting in front of the ASG and it is configured to monitor the health of each instance in the group using a health check endpoint. The load balancer would only forward the live traffic to an instance if its health check endpoint returns a 200 status code (we also have multiple checks in a row for detecting failures).

polarizeme commented 3 years ago

@jichen-amplify apologies for the delay in getting back to you.

We've tested this in DEV for a couple weeks, everything was great, and I've rolled it into PROD.

Basically we did two things:

The load balancer healthchecks are set up to hit the infinispan health endpoint included in @thomasdarimont's module. The infinispan endpoint specifically (/auth/realms/master/health/check/infinispan-health), as that's tied to cluster state. The check is set to expect a 200 as usual. If an instance is unhealthy at this endpoint, we know it's no longer joined to the cluster (or at least unable to communicate with it)
We have a cloudwatch alarm set to email us via SNS topic if the cluster ever reaches >0 UnhealthyHostCount

There are a couple other things you could do here, too:

If you're using infrastructure-as-code practices and some form of orchestration (CFN or Terraform), you can use a Launch Config update policy to make sure healthchecks are ignored while there are scale-in or scale-out events
If you're using any cloud-init, you can set your instance(s) to send a signal to your CFN stack when they're ready, and you can set your Launch Config creation and update policies for timeouts and the count of the aforementioned signal, etc. This will keep your stack from reaching a COMPLETE state before the cluster is actually healthy, and it'll keep your cluster from terminating seemingly unhealthy instances before they can finish coming up and configuring themselves, etc.

Hope this helps a bit & good luck!

thomasdarimont / keycloak-health-checks

Usefulness for load balancing purpose? #17