Closed kami619 closed 1 month ago
Cluster Failover test in Active/Active multi-site setup:
The images reflect a high-availability setup where a failover event was triggered, causing the secondary cluster to take on additional traffic. The failover caused a temporary performance degradation, as evidenced by the increased response times and reduced percentage of requests served under 250ms. This behavior is typical during a failover, and the system seems to have recovered after a brief period of instability, for about 90 secs approximately.
Cluster which experienced failure:
Cluster which took on the additional load:
Results: kc_failover_cluster_failure_test.zip
Random Pod failure scenario:
When the pods are interrupted regularly under load, the response times were not impacted and a typical recovery time for a pod was about 18.63 seconds on average.
Console output log: kc_chaos_test.zip
task chaos:kill-keycloak
results were as expected, however task chaos:kill-infinispan
was far more disruptive than anticipated. Below are the screenshots of the SLO dashboards during both runs and attached is the console log of the test execution. I will try and reproduce these tomorrow and see if the same happens again.
Thank you for sharing the graphs. It looks like the disruption and increased latencies happend for a small period of time (1 minute)? and then recovered.
It would be good to see the errors in the Keycloak log and the Infinispan logs.
Thought about causes: Maybe it takes a moment to for Infinispan to detect the failure (JGroups) and then to form a new view. Not sure what default timeouts are defined there. Maybe @pruivo and @ryanemerson can give an insight of the "expected" time the disturbance would last while a member leaves unexpectedly.
When we such results in the killing of Infinispan pods, can you also test a regular rolling restart of the Infinispan pods if we see the same behavior there? A rolling graceful restart might happend more often, and a user would expect that to be more gracefully handled than having high latency spikes.
1 min is not too bad. Looks like we are running with the default values: heartbeat every 8 seconds with 40s timeout. So, 48s in the worst possible timing.
[1000820000@infinispan-0 infinispan]$ ./bin/probe.sh FD_ALL3
#1 (555 bytes):
infinispan-0-41072 [ip=10.130.0.20:7800, 3 mbr(s), cluster=ISPN, version=5.3.10.Final (Zoncolan) (java 21.0.4+7-LTS)]
FD_ALL3={after_creation_hook=null, ergonomics=true, has_suspected_mbrs=false, heartbeat_sender_running=true, id=70, interval=8s, level=INFO, local_addr=infinispan-0-41072, members=(3) infinispan-0-41072,infinispan-1-40746,infinispan-2-41954, num_bits=5, num_heartbeats_received=2,705, num_heartbeats_sent=44, num_suspect_events=0, policies=n/a, running=true, stats=true, suspected_members=[], timeout=40s, timeout_checker_running=true}
@kami619 it would be nice to have the logs from one of the Keycloak nodes. The Hot Rod client timeouts should be smaller and I don't expect Gatling to observe 30s response time. I assume this runs with persistent user sessions enabled.
@pruivo sure thing, I am going to reproduce this now and add all the logs and other information that is requested.
We verified these interruptions in an updated Grafana dashboard (https://github.com/keycloak/keycloak-benchmark/commit/e46bab1bace5ffc28cf4b42a4b346d39eae8a5d8) and the outage times looked much narrower than before around 90 seconds, the logs from both Keycloak and Infinispan don't indicate any critical failures or bottlenecks, hence these failure tests are now resulting in acceptable outcomes.