Test scenarios:

Manually switch from the active to the passive site under load

Document steps needed to do the switch
Observe the recovery time needed (higher response times and maybe even failures expected as the passive site is not warmed up)

Make the passive site fail

Kill the connection between active and passive sites to simulate a connection disruption. The idea is to remove the gossip router public route (we probably also need to kill/stop the Infinispan operator to make sure the operator does not recreate the route)
Observe the active site - how long does it take for the active site to detect passive is down, does the active site work correctly without the passive

kc-failover.sh script, plus docs and associated changes required for failover tests added as part of https://github.com/keycloak/keycloak-benchmark/pull/507

Deployment

AURORA_CLUSTER=ryan-aurora
AURORA_REGION=${REGION}
ROSA_CLUSTER_NAME_1=ryan-active
ROSA_CLUSTER_NAME_2=ryan-passive
DOMAIN="ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com"
KC_CLIENT_URL="client.${DOMAIN}"
KC_HEALTH_URL_CLUSTER_1="primary.${DOMAIN}"
KC_HEALTH_URL_CLUSTER_2="backup.${DOMAIN}"
CROSS_DC_IMAGE=quay.io/pruivo/infinispan-server:14.0-patched

KC_DB_POOL_MIN_SIZE=15
KC_DB_POOL_MAX_SIZE=15
KC_DB_POOL_INITIAL_SIZE=15
KC_INSTANCES=6
KC_DISABLE_STICKY_SESSION='true'
KC_MEMORY_REQUESTS_MB=2048
KC_MEMORY_LIMITS_MB=3000
KC_HEAP_MAX_MB=2048
KC_HEAP_INIT_MB=1024
KC_CPU_REQUESTS=8
KC_CPU_LIMITS=8
KC_OTEL='true'
KC_CUSTOM_INFINISPAN_CONFIG_FILE=config/kcb-infinispan-cache-remote-store-config.xml

Keycloak dataset intialised with:

curl -v -k "https://${KC_CLIENT_URL}/realms/master/dataset/create-realms?realm-name=realm-0&count=1&threads-count=10&users-per-realm=10000"

Benchmark Setup

./ansible/benchmark.sh and kc-failover.sh script are executed concurrently. The latter is executed manually on one of the EC2 nodes used by benchmark.sh to ensure that the Route53 failover latency can be more accurately observed.

Environment:

cluster_size: 2
kcb_zip: ../benchmark/target/keycloak-benchmark-0.10-SNAPSHOT.zip

Benchmark cmd

DOMAIN="ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com"
REGION=eu-west-1
KC_CLIENT_URL="https://client.${DOMAIN}"
./benchmark.sh ${REGION} \
  --scenario=keycloak.scenario.authentication.AuthorizationCode \
  --server-url=${KC_CLIENT_URL} \
  --users-per-sec=300 \
  --measurement=300 \
  --realm-name=realm-0 \
  --logout-percentage=0 \
  --users-per-realm=10000 \
  --ramp-up=100 \
  --log-http-on-failure

Initial Results

Disable Health Check

Failover cmd:

PROJECT=remerson-keycloak DOMAIN=ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com FAILOVER_MODE="HEALTH_PROBE" FAILOVER_DELAY=120 ./bin/kc-failover.sh

Route53 Failover time: 63 Seconds

Screenshot from 2023-08-29 15-23-33

Looking at the Infinispan logs, in this scenario there seems to be an issue here with lock contention as I believe writes are happening on both the active and passive site:

14:05:45,567 WARN  (jgroups-239,infinispan-0-40258) [org.infinispan.CLUSTER] ISPN000071: Caught exception when handling command SingleXSiteRpcCommand{command=RemoveCommand{key=WrappedByteArray[0304090000000E\j\a\v\a\.\u\t\i\l\.\U\U\I\DBC9903F798\m85\/000000020000000C\l\e\a\s\t\S\i\g\B\i\t\s\$000000000B\m\o\s\t\S\i\g\B\i... (85 bytes)], value=null, returnEntry=false, metadata=null, internalMetadata=null, flags=[], commandInvocationId=CommandInvocation:7e2b2d82-8f0c-42ce-8eed-bb3a0d057e98:1305642, valueMatcher=MATCH_ALWAYS, topologyId=-1}} org.infinispan.remoting.RemoteException: ISPN000217: Received exception from infinispan-2-46837, see cause for remote stack trace
    at org.infinispan.remoting.transport.ResponseCollectors.wrapRemoteException(ResponseCollectors.java:25)
    at org.infinispan.remoting.transport.ValidSingleResponseCollector.withException(ValidSingleResponseCollector.java:37)
    at org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:21)
    at org.infinispan.remoting.transport.impl.SingleTargetRequest.addResponse(SingleTargetRequest.java:75)
    at org.infinispan.remoting.transport.impl.SingleTargetRequest.onResponse(SingleTargetRequest.java:45)
    at org.infinispan.remoting.transport.impl.RequestRepository.addResponse(RequestRepository.java:51)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1578)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1480)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1685)
    at org.jgroups.JChannel.up(JChannel.java:733)
    at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:921)
    at org.jgroups.protocols.relay.RELAY2.up(RELAY2.java:527)
    at org.jgroups.protocols.FRAG2.up(FRAG2.java:138)
    at org.jgroups.protocols.FlowControl.up(FlowControl.java:245)
    at org.jgroups.protocols.FlowControl.up(FlowControl.java:245)
    at org.jgroups.protocols.pbcast.GMS.up(GMS.java:845)
    at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:226)
    at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1052)
    at org.jgroups.protocols.UNICAST3.addMessage(UNICAST3.java:794)
    at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:776)
    at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:425)
    at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:658)
    at org.jgroups.protocols.VERIFY_SUSPECT2.up(VERIFY_SUSPECT2.java:105)
    at org.jgroups.protocols.FailureDetection.up(FailureDetection.java:180)
    at org.jgroups.protocols.FD_SOCK2.up(FD_SOCK2.java:188)
    at org.jgroups.protocols.MERGE3.up(MERGE3.java:274)
    at org.jgroups.protocols.Discovery.up(Discovery.java:294)
    at org.jgroups.stack.Protocol.up(Protocol.java:314)
    at org.jgroups.protocols.TP.passMessageUp(TP.java:1178)
    at org.jgroups.util.SubmitToThreadPool$SingleMessageHandler.run(SubmitToThreadPool.java:100)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 10 seconds for key WrappedByteArray[0304090000000E\j\a\v\a\.\u\t\i\l\.\U\U\I\DBC9903F798\m85\/000000020000000C\l\e\a\s\t\S\i\g\B\i\t\s\$000000000B\m\o\s\t\S\i\g\B\i... (85 bytes)] and requestor CommandInvocation:infinispan-0-40258:832116. Lock is held by CommandInvocation:infinispan-2-46837:821621
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
    at org.infinispan.marshall.exts.ThrowableExternalizer.newThrowableInstance(ThrowableExternalizer.java:287)
    at org.infinispan.marshall.exts.ThrowableExternalizer.readGenericThrowable(ThrowableExternalizer.java:265)
    at org.infinispan.marshall.exts.ThrowableExternalizer.readObject(ThrowableExternalizer.java:240)
    at org.infinispan.marshall.exts.ThrowableExternalizer.readObject(ThrowableExternalizer.java:44)
    at org.infinispan.marshall.core.GlobalMarshaller.readWithExternalizer(GlobalMarshaller.java:727)
    at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:708)
    at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:357)
    at org.infinispan.marshall.core.BytesObjectInput.readObject(BytesObjectInput.java:32)
    at org.infinispan.remoting.responses.ExceptionResponse$Externalizer.readObject(ExceptionResponse.java:49)
    at org.infinispan.remoting.responses.ExceptionResponse$Externalizer.readObject(ExceptionResponse.java:41)
    at org.infinispan.marshall.core.GlobalMarshaller.readWithExternalizer(GlobalMarshaller.java:727)
    at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:708)
    at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:357)
    at org.infinispan.marshall.core.GlobalMarshaller.objectFromObjectInput(GlobalMarshaller.java:191)
    at org.infinispan.marshall.core.GlobalMarshaller.objectFromByteBuffer(GlobalMarshaller.java:220)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1570)
    ... 26 more

Delete All Routes

Failover cmd:

PROJECT=remerson-keycloak DOMAIN=ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com FAILOVER_MODE="ALL_ROUTES" FAILOVER_DELAY=120 ./bin/kc-failover.sh

Route53 Failover time: 57 Seconds

Screenshot from 2023-08-29 15-25-41

Detailed results and exported grafana dashbboards an be found on the google drive

Delete All Routes 2

The Grafana snapshots in failover-all-routes.tar.gz were not correctly exported.

Route53 Failover time: 58 Seconds

failover-all-routes-2.tar.gz

Screenshot from 2023-08-29 16-31-41

Disable Health Check 2

Route53 Failover time: 72 seconds

failover-health-probe-2.tar.gz

Screenshot from 2023-08-29 16-44-11

Disable Health Check 3

Following on from our discussions this afternoon. The failover now occurs after 4 minutes and the benchmark has been updated to use the following parameters:

./benchmark.sh ${REGION} \
  --scenario=keycloak.scenario.authentication.AuthorizationCode \
  --server-url=${KC_CLIENT_URL} \
  --users-per-sec=300 \
  --measurement=600 \
  --realm-name=realm-0 \
  --logout-percentage=100 \
  --users-per-realm=10000 \
  --ramp-up=100 \
  --log-http-on-failure

As hoped, utilising --logout-percentage=100 has removed the build up of active-users towards the end of the benchmark.

In the infinispan logs there were no lock acquisition exceptions as seen in the first run, even though I ran the benchmark across two EC2 nodes as before.

Screenshot from 2023-08-30 16-30-45

failover-health-probe-3.tar.gz

Active Site Gossip Router Fail

Utilising the following benchmark with two EC2 clients:

./benchmark.sh ${REGION} \
  --scenario=keycloak.scenario.authentication.AuthorizationCode \
  --server-url=${KC_CLIENT_URL} \
  --users-per-sec=300 \
  --measurement=600 \
  --realm-name=realm-0 \
  --logout-percentage=100 \
  --users-per-realm=10000 \
  --ramp-up=100 \
  --log-http-on-failure

Kill the connection between active and passive sites to simulate a connection disruption. The idea is to remove the gossip router public route (we probably also need to kill/stop the Infinispan operator to make sure the operator does not recreate the route)

Gossip Router pods on the active cluster are killed to simulate a network split between the active and passive site. Removing the GossipRouter Route is not sufficient, as the TCP connections established between the two sites remain open until the GR pod is killed.

Observe the active site - how long does it take for the active site to detect passive is down, does the active site work correctly without the passive

Requests to the active site fail for ~ 45 seconds, after which the passive site is suspected and the active site continues to function.

Screenshot from 2023-08-31 11-23-14

Screenshot from 2023-08-31 11-23-30

~Grafana hadn't installed correctly on my active site, so unfortunately the dashboards are missing from the results. I can repeat the experiment as required though.~

I have dashboards for the passive site, just not the active. Uploaded tar has been updated.

gossip-router-fail-active.tar.gz

Passive site cluster fail

All Infinispan and Keycloak pods are killed on the passive site to simulate the site going down.

As expected this results in comparable results to the Active Gossip Router failure scenario as xsite operations between the Active and Passive site fail until the passive site becomes suspected by JGroups.

Failover cmd:

DOMAIN=ryan-active-ryan-passive-mzeymtmk.keycloak-benchmark.com PROJECT=remerson-keycloak FAILOVER_MODE=CLUSTER_FAIL FAILOVER_DELAY=240 ./bin/kc-failover.sh

Screenshot from 2023-08-31 14-07-38

Screenshot from 2023-08-31 14-08-00

passive-site-cluster-fail.tar.gz

Discussion:

Downtime on active when passive goes down could be configured with different probes down to ~15 seconds
On manual takeover, a CLI command could take passive offline before manual shutdown to avoid the downtime in active
Spikes in response time in active are expected when any pod in passive goes down unexpectedly.

(Reducing the downtime from now 40 seconds to then 15 seconds requires a change in the Infinispan Operator)

Alternative: go to async mode, with the tradeoff that not all updates in active are available in the passive site. In async mode, a resync is necessary when the passive side is completely restarted. Keys are kept in memory for the queue for things to be sent.

Check with @stianst to verify if this is acceptable.

Most "sensible scenario":

If a Cross-DC write operation fails, throw an exception so that the session is not in the primary caches, so we don't need to issue a state transfer on single transfer failures (not implemented yet) / #531
If a Cross-DC connection becomes offline, a manual state transfer needs to be issued from the primary site to bring the connection online again and transfer the state / #533
Timeout should be reduced to ~15 seconds (not implemented yet) / #532

Why customers want a SYNC setup:

No chance of data loss due to ASYNC replication

Follow-ups:

How to configure a health probe so that Route 53 or another loadbalancing service doesn't send requests to a site which doesn't have all the data / is in need for a state transfer (applies to both ASYNC and SYNC) / #534
Show with some functional testing (applies to ASYNC and SYNC): / #535
- logging in one site, using the session the side
- logging out one site, verifying the session is gone on the side
- modifying a client on one side, verifying the change is available on the other side (even if it was already cached on the other side)

Follow-up tasks have been captured in new issues, closing this issue.

keycloak / keycloak-benchmark

A/P failure testing in SYNC mode #495

Deployment

Benchmark Setup

Initial Results

Disable Health Check

Delete All Routes

Delete All Routes 2

Disable Health Check 2

Disable Health Check 3

Active Site Gossip Router Fail

Passive site cluster fail