keycloak / keycloak-benchmark

Keycloak Benchmark
https://www.keycloak.org/keycloak-benchmark/
Apache License 2.0
128 stars 71 forks source link

After a restart of Keycloak, the Route 53 health check might have failed #799

Closed ahus1 closed 4 months ago

ahus1 commented 5 months ago

Describe the bug

As part of #776, we added a Keycloak restart to reset Keycloak's memory usage. When doing this, we forgot to add a re-arming of the Route 53 health check, as the AWS lambda might have disabled the primary site while the restart is on its way.

Version

main

Expected behavior

After the restart, the traffic should go to the primary site.

Actual behavior

After the restart, depending on how fast the restart was, the traffic might go to the secondary site. Due to this, all metrics retrieved from the primary site would be irrelevant.

How to Reproduce?

Might happen only sometimes. First noticed https://github.com/keycloak/keycloak-benchmark/actions/runs/8969995290/job/24635509387 reported no xsite messages sent from the first site.

Anything else?

No response

ahus1 commented 5 months ago

@kami619 - could you please have a look? As a workaround, I've disabled the restart in the nightly run.

It might not have happend on every run, but at least the one listed above.

cc: @andyuk1986

ahus1 commented 5 months ago

An alternative to the hard restart would be a rolling update - then the health check shouldn't trigger.

andyuk1986 commented 5 months ago

@ahus1 yeah, I have seen this once as well, but then couldn't reproduce. Thought that perhaps this happens when we run the benchmark for several times and the data is already cached, that's why metrics doesn't show xsite requests in the time range, but as I said couldn't reproduce this.

kami619 commented 5 months ago

@ahus1 sure thing, let me see if I can get in a task to re-arm the route53 health check after we do the restart as needed, it would also mean we need to route the traffic as needed to the right cluster by validating which is the primary.

kami619 commented 5 months ago

@ahus1

https://github.com/kami619/keycloak-benchmark/actions/runs/8973649052/job/24644308418#step:27:111 this failed again for same reason, even though both the sites are working and route53 health checks are reporting healthy. It succeeded a prior attempt, so not sure if its directly tied to the xsite health checks.

Screenshot 2024-05-06 at 14 24 13

ahus1 commented 5 months ago

It failed with the same error message, but at a different step: This time it failed after the test "client credential grants" completed. Giving it another thought, this is expected: During "client credential grants", there are no expected. I'm pushing a workaround for this: 9a817d71060203876b0a3338ff7826b4e40d6002