keycloak / keycloak-benchmark

Keycloak Benchmark
https://www.keycloak.org/keycloak-benchmark/
Apache License 2.0
128 stars 71 forks source link

A/P failure testing in SYNC mode #495

Closed PavelVlha closed 1 year ago

PavelVlha commented 1 year ago

Test scenarios:

Manually switch from the active to the passive site under load

Make the passive site fail

ryanemerson commented 1 year ago

kc-failover.sh script, plus docs and associated changes required for failover tests added as part of https://github.com/keycloak/keycloak-benchmark/pull/507

Deployment

AURORA_CLUSTER=ryan-aurora
AURORA_REGION=${REGION}
ROSA_CLUSTER_NAME_1=ryan-active
ROSA_CLUSTER_NAME_2=ryan-passive
DOMAIN="ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com"
KC_CLIENT_URL="client.${DOMAIN}"
KC_HEALTH_URL_CLUSTER_1="primary.${DOMAIN}"
KC_HEALTH_URL_CLUSTER_2="backup.${DOMAIN}"
CROSS_DC_IMAGE=quay.io/pruivo/infinispan-server:14.0-patched

KC_DB_POOL_MIN_SIZE=15
KC_DB_POOL_MAX_SIZE=15
KC_DB_POOL_INITIAL_SIZE=15
KC_INSTANCES=6
KC_DISABLE_STICKY_SESSION='true'
KC_MEMORY_REQUESTS_MB=2048
KC_MEMORY_LIMITS_MB=3000
KC_HEAP_MAX_MB=2048
KC_HEAP_INIT_MB=1024
KC_CPU_REQUESTS=8
KC_CPU_LIMITS=8
KC_OTEL='true'
KC_CUSTOM_INFINISPAN_CONFIG_FILE=config/kcb-infinispan-cache-remote-store-config.xml

Keycloak dataset intialised with:

curl -v -k "https://${KC_CLIENT_URL}/realms/master/dataset/create-realms?realm-name=realm-0&count=1&threads-count=10&users-per-realm=10000"

Benchmark Setup

./ansible/benchmark.sh and kc-failover.sh script are executed concurrently. The latter is executed manually on one of the EC2 nodes used by benchmark.sh to ensure that the Route53 failover latency can be more accurately observed.

Environment:

cluster_size: 2
kcb_zip: ../benchmark/target/keycloak-benchmark-0.10-SNAPSHOT.zip

Benchmark cmd

DOMAIN="ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com"
REGION=eu-west-1
KC_CLIENT_URL="https://client.${DOMAIN}"
./benchmark.sh ${REGION} \
  --scenario=keycloak.scenario.authentication.AuthorizationCode \
  --server-url=${KC_CLIENT_URL} \
  --users-per-sec=300 \
  --measurement=300 \
  --realm-name=realm-0 \
  --logout-percentage=0 \
  --users-per-realm=10000 \
  --ramp-up=100 \
  --log-http-on-failure

Initial Results

Disable Health Check

Failover cmd:

PROJECT=remerson-keycloak DOMAIN=ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com FAILOVER_MODE="HEALTH_PROBE" FAILOVER_DELAY=120 ./bin/kc-failover.sh

Route53 Failover time: 63 Seconds

Screenshot from 2023-08-29 15-23-33

Looking at the Infinispan logs, in this scenario there seems to be an issue here with lock contention as I believe writes are happening on both the active and passive site:

14:05:45,567 WARN  (jgroups-239,infinispan-0-40258) [org.infinispan.CLUSTER] ISPN000071: Caught exception when handling command SingleXSiteRpcCommand{command=RemoveCommand{key=WrappedByteArray[0304090000000E\j\a\v\a\.\u\t\i\l\.\U\U\I\DBC9903F798\m85\/000000020000000C\l\e\a\s\t\S\i\g\B\i\t\s\$000000000B\m\o\s\t\S\i\g\B\i... (85 bytes)], value=null, returnEntry=false, metadata=null, internalMetadata=null, flags=[], commandInvocationId=CommandInvocation:7e2b2d82-8f0c-42ce-8eed-bb3a0d057e98:1305642, valueMatcher=MATCH_ALWAYS, topologyId=-1}} org.infinispan.remoting.RemoteException: ISPN000217: Received exception from infinispan-2-46837, see cause for remote stack trace
    at org.infinispan.remoting.transport.ResponseCollectors.wrapRemoteException(ResponseCollectors.java:25)
    at org.infinispan.remoting.transport.ValidSingleResponseCollector.withException(ValidSingleResponseCollector.java:37)
    at org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:21)
    at org.infinispan.remoting.transport.impl.SingleTargetRequest.addResponse(SingleTargetRequest.java:75)
    at org.infinispan.remoting.transport.impl.SingleTargetRequest.onResponse(SingleTargetRequest.java:45)
    at org.infinispan.remoting.transport.impl.RequestRepository.addResponse(RequestRepository.java:51)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1578)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1480)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1685)
    at org.jgroups.JChannel.up(JChannel.java:733)
    at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:921)
    at org.jgroups.protocols.relay.RELAY2.up(RELAY2.java:527)
    at org.jgroups.protocols.FRAG2.up(FRAG2.java:138)
    at org.jgroups.protocols.FlowControl.up(FlowControl.java:245)
    at org.jgroups.protocols.FlowControl.up(FlowControl.java:245)
    at org.jgroups.protocols.pbcast.GMS.up(GMS.java:845)
    at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:226)
    at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1052)
    at org.jgroups.protocols.UNICAST3.addMessage(UNICAST3.java:794)
    at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:776)
    at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:425)
    at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:658)
    at org.jgroups.protocols.VERIFY_SUSPECT2.up(VERIFY_SUSPECT2.java:105)
    at org.jgroups.protocols.FailureDetection.up(FailureDetection.java:180)
    at org.jgroups.protocols.FD_SOCK2.up(FD_SOCK2.java:188)
    at org.jgroups.protocols.MERGE3.up(MERGE3.java:274)
    at org.jgroups.protocols.Discovery.up(Discovery.java:294)
    at org.jgroups.stack.Protocol.up(Protocol.java:314)
    at org.jgroups.protocols.TP.passMessageUp(TP.java:1178)
    at org.jgroups.util.SubmitToThreadPool$SingleMessageHandler.run(SubmitToThreadPool.java:100)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 10 seconds for key WrappedByteArray[0304090000000E\j\a\v\a\.\u\t\i\l\.\U\U\I\DBC9903F798\m85\/000000020000000C\l\e\a\s\t\S\i\g\B\i\t\s\$000000000B\m\o\s\t\S\i\g\B\i... (85 bytes)] and requestor CommandInvocation:infinispan-0-40258:832116. Lock is held by CommandInvocation:infinispan-2-46837:821621
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
    at org.infinispan.marshall.exts.ThrowableExternalizer.newThrowableInstance(ThrowableExternalizer.java:287)
    at org.infinispan.marshall.exts.ThrowableExternalizer.readGenericThrowable(ThrowableExternalizer.java:265)
    at org.infinispan.marshall.exts.ThrowableExternalizer.readObject(ThrowableExternalizer.java:240)
    at org.infinispan.marshall.exts.ThrowableExternalizer.readObject(ThrowableExternalizer.java:44)
    at org.infinispan.marshall.core.GlobalMarshaller.readWithExternalizer(GlobalMarshaller.java:727)
    at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:708)
    at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:357)
    at org.infinispan.marshall.core.BytesObjectInput.readObject(BytesObjectInput.java:32)
    at org.infinispan.remoting.responses.ExceptionResponse$Externalizer.readObject(ExceptionResponse.java:49)
    at org.infinispan.remoting.responses.ExceptionResponse$Externalizer.readObject(ExceptionResponse.java:41)
    at org.infinispan.marshall.core.GlobalMarshaller.readWithExternalizer(GlobalMarshaller.java:727)
    at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:708)
    at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:357)
    at org.infinispan.marshall.core.GlobalMarshaller.objectFromObjectInput(GlobalMarshaller.java:191)
    at org.infinispan.marshall.core.GlobalMarshaller.objectFromByteBuffer(GlobalMarshaller.java:220)
    at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1570)
    ... 26 more

Delete All Routes

Failover cmd:

PROJECT=remerson-keycloak DOMAIN=ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com FAILOVER_MODE="ALL_ROUTES" FAILOVER_DELAY=120 ./bin/kc-failover.sh

Route53 Failover time: 57 Seconds

Screenshot from 2023-08-29 15-25-41

Detailed results and exported grafana dashbboards an be found on the google drive

ryanemerson commented 1 year ago

Delete All Routes 2

The Grafana snapshots in failover-all-routes.tar.gz were not correctly exported.

Route53 Failover time: 58 Seconds

failover-all-routes-2.tar.gz

Screenshot from 2023-08-29 16-31-41

ryanemerson commented 1 year ago

Disable Health Check 2

Route53 Failover time: 72 seconds

failover-health-probe-2.tar.gz

Screenshot from 2023-08-29 16-44-11

ryanemerson commented 1 year ago

Disable Health Check 3

Following on from our discussions this afternoon. The failover now occurs after 4 minutes and the benchmark has been updated to use the following parameters:

./benchmark.sh ${REGION} \
  --scenario=keycloak.scenario.authentication.AuthorizationCode \
  --server-url=${KC_CLIENT_URL} \
  --users-per-sec=300 \
  --measurement=600 \
  --realm-name=realm-0 \
  --logout-percentage=100 \
  --users-per-realm=10000 \
  --ramp-up=100 \
  --log-http-on-failure

As hoped, utilising --logout-percentage=100 has removed the build up of active-users towards the end of the benchmark.

In the infinispan logs there were no lock acquisition exceptions as seen in the first run, even though I ran the benchmark across two EC2 nodes as before.

Screenshot from 2023-08-30 16-30-45

failover-health-probe-3.tar.gz

ryanemerson commented 1 year ago

Active Site Gossip Router Fail

Utilising the following benchmark with two EC2 clients:

./benchmark.sh ${REGION} \
  --scenario=keycloak.scenario.authentication.AuthorizationCode \
  --server-url=${KC_CLIENT_URL} \
  --users-per-sec=300 \
  --measurement=600 \
  --realm-name=realm-0 \
  --logout-percentage=100 \
  --users-per-realm=10000 \
  --ramp-up=100 \
  --log-http-on-failure

Kill the connection between active and passive sites to simulate a connection disruption. The idea is to remove the gossip router public route (we probably also need to kill/stop the Infinispan operator to make sure the operator does not recreate the route)

Gossip Router pods on the active cluster are killed to simulate a network split between the active and passive site. Removing the GossipRouter Route is not sufficient, as the TCP connections established between the two sites remain open until the GR pod is killed.

Observe the active site - how long does it take for the active site to detect passive is down, does the active site work correctly without the passive

Requests to the active site fail for ~ 45 seconds, after which the passive site is suspected and the active site continues to function.

Screenshot from 2023-08-31 11-23-14

Screenshot from 2023-08-31 11-23-30

~Grafana hadn't installed correctly on my active site, so unfortunately the dashboards are missing from the results. I can repeat the experiment as required though.~

I have dashboards for the passive site, just not the active. Uploaded tar has been updated.

gossip-router-fail-active.tar.gz

ryanemerson commented 1 year ago

Passive site cluster fail

All Infinispan and Keycloak pods are killed on the passive site to simulate the site going down.

As expected this results in comparable results to the Active Gossip Router failure scenario as xsite operations between the Active and Passive site fail until the passive site becomes suspected by JGroups.

Failover cmd:

DOMAIN=ryan-active-ryan-passive-mzeymtmk.keycloak-benchmark.com PROJECT=remerson-keycloak FAILOVER_MODE=CLUSTER_FAIL FAILOVER_DELAY=240 ./bin/kc-failover.sh

Screenshot from 2023-08-31 14-07-38 Screenshot from 2023-08-31 14-07-49

Screenshot from 2023-08-31 14-08-00

passive-site-cluster-fail.tar.gz

ahus1 commented 1 year ago

Discussion:

(Reducing the downtime from now 40 seconds to then 15 seconds requires a change in the Infinispan Operator)

Alternative: go to async mode, with the tradeoff that not all updates in active are available in the passive site. In async mode, a resync is necessary when the passive side is completely restarted. Keys are kept in memory for the queue for things to be sent.

Check with @stianst to verify if this is acceptable.

ahus1 commented 1 year ago

Most "sensible scenario":

Why customers want a SYNC setup:

Follow-ups:

ahus1 commented 1 year ago

Follow-up tasks have been captured in new issues, closing this issue.