Run A/A failure tests and document the outcomes

kami619 commented 2 months ago

### Tasks
- [x] Run the Krkn chaos based tests
- [x] Run kc-chaos.sh based tests
- [x] Run the cluster failure tests from kc-failover.sh
- [x] Run the gossip router failure tests from kc-failover.sh

kami619 commented 2 months ago

Cluster Failover test in Active/Active multi-site setup:

The images reflect a high-availability setup where a failover event was triggered, causing the secondary cluster to take on additional traffic. The failover caused a temporary performance degradation, as evidenced by the increased response times and reduced percentage of requests served under 250ms. This behavior is typical during a failover, and the system seems to have recovered after a brief period of instability, for about 90 secs approximately.

Cluster which experienced failure: Screenshot 2024-09-01 at 20 31 53

Cluster which took on the additional load: Screenshot 2024-09-01 at 20 31 36

Results: kc_failover_cluster_failure_test.zip

kami619 commented 2 months ago

Random Pod failure scenario:

When the pods are interrupted regularly under load, the response times were not impacted and a typical recovery time for a pod was about 18.63 seconds on average.

Console output log: kc_chaos_test.zip

kami619 commented 2 months ago

task chaos:kill-keycloak results were as expected, however task chaos:kill-infinispan was far more disruptive than anticipated. Below are the screenshots of the SLO dashboards during both runs and attached is the console log of the test execution. I will try and reproduce these tomorrow and see if the same happens again.

screencapture-grafana-apps-rosa-gh-keycloak-a-2cj1-p3-openshiftapps-d-R3kK-894z-authentication-code-slo-2024-09-03-16_38_37

krkn-logs.txt

ahus1 commented 1 month ago

Thank you for sharing the graphs. It looks like the disruption and increased latencies happend for a small period of time (1 minute)? and then recovered.

It would be good to see the errors in the Keycloak log and the Infinispan logs.

Thought about causes: Maybe it takes a moment to for Infinispan to detect the failure (JGroups) and then to form a new view. Not sure what default timeouts are defined there. Maybe @pruivo and @ryanemerson can give an insight of the "expected" time the disturbance would last while a member leaves unexpectedly.

When we such results in the killing of Infinispan pods, can you also test a regular rolling restart of the Infinispan pods if we see the same behavior there? A rolling graceful restart might happend more often, and a user would expect that to be more gracefully handled than having high latency spikes.

pruivo commented 1 month ago

1 min is not too bad. Looks like we are running with the default values: heartbeat every 8 seconds with 40s timeout. So, 48s in the worst possible timing.

[1000820000@infinispan-0 infinispan]$ ./bin/probe.sh  FD_ALL3

#1 (555 bytes):
infinispan-0-41072 [ip=10.130.0.20:7800, 3 mbr(s), cluster=ISPN, version=5.3.10.Final (Zoncolan) (java 21.0.4+7-LTS)]
FD_ALL3={after_creation_hook=null, ergonomics=true, has_suspected_mbrs=false, heartbeat_sender_running=true, id=70, interval=8s, level=INFO, local_addr=infinispan-0-41072, members=(3) infinispan-0-41072,infinispan-1-40746,infinispan-2-41954, num_bits=5, num_heartbeats_received=2,705, num_heartbeats_sent=44, num_suspect_events=0, policies=n/a, running=true, stats=true, suspected_members=[], timeout=40s, timeout_checker_running=true}

@kami619 it would be nice to have the logs from one of the Keycloak nodes. The Hot Rod client timeouts should be smaller and I don't expect Gatling to observe 30s response time. I assume this runs with persistent user sessions enabled.

kami619 commented 1 month ago

@pruivo sure thing, I am going to reproduce this now and add all the logs and other information that is requested.

kami619 commented 1 month ago

We verified these interruptions in an updated Grafana dashboard (https://github.com/keycloak/keycloak-benchmark/commit/e46bab1bace5ffc28cf4b42a4b346d39eae8a5d8) and the outage times looked much narrower than before around 90 seconds, the logs from both Keycloak and Infinispan don't indicate any critical failures or bottlenecks, hence these failure tests are now resulting in acceptable outcomes.

keycloak / keycloak-benchmark

Run A/A failure tests and document the outcomes #946