Closed PavelVlha closed 1 year ago
kc-failover.sh
script, plus docs and associated changes required for failover tests added as part of https://github.com/keycloak/keycloak-benchmark/pull/507
AURORA_CLUSTER=ryan-aurora
AURORA_REGION=${REGION}
ROSA_CLUSTER_NAME_1=ryan-active
ROSA_CLUSTER_NAME_2=ryan-passive
DOMAIN="ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com"
KC_CLIENT_URL="client.${DOMAIN}"
KC_HEALTH_URL_CLUSTER_1="primary.${DOMAIN}"
KC_HEALTH_URL_CLUSTER_2="backup.${DOMAIN}"
CROSS_DC_IMAGE=quay.io/pruivo/infinispan-server:14.0-patched
KC_DB_POOL_MIN_SIZE=15
KC_DB_POOL_MAX_SIZE=15
KC_DB_POOL_INITIAL_SIZE=15
KC_INSTANCES=6
KC_DISABLE_STICKY_SESSION='true'
KC_MEMORY_REQUESTS_MB=2048
KC_MEMORY_LIMITS_MB=3000
KC_HEAP_MAX_MB=2048
KC_HEAP_INIT_MB=1024
KC_CPU_REQUESTS=8
KC_CPU_LIMITS=8
KC_OTEL='true'
KC_CUSTOM_INFINISPAN_CONFIG_FILE=config/kcb-infinispan-cache-remote-store-config.xml
Keycloak dataset intialised with:
curl -v -k "https://${KC_CLIENT_URL}/realms/master/dataset/create-realms?realm-name=realm-0&count=1&threads-count=10&users-per-realm=10000"
./ansible/benchmark.sh
and kc-failover.sh
script are executed concurrently. The latter is executed manually on one of the EC2 nodes used by benchmark.sh
to ensure that the Route53 failover latency can be more accurately observed.
Environment:
cluster_size: 2
kcb_zip: ../benchmark/target/keycloak-benchmark-0.10-SNAPSHOT.zip
Benchmark cmd
DOMAIN="ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com"
REGION=eu-west-1
KC_CLIENT_URL="https://client.${DOMAIN}"
./benchmark.sh ${REGION} \
--scenario=keycloak.scenario.authentication.AuthorizationCode \
--server-url=${KC_CLIENT_URL} \
--users-per-sec=300 \
--measurement=300 \
--realm-name=realm-0 \
--logout-percentage=0 \
--users-per-realm=10000 \
--ramp-up=100 \
--log-http-on-failure
Failover cmd:
PROJECT=remerson-keycloak DOMAIN=ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com FAILOVER_MODE="HEALTH_PROBE" FAILOVER_DELAY=120 ./bin/kc-failover.sh
Route53 Failover time: 63 Seconds
Looking at the Infinispan logs, in this scenario there seems to be an issue here with lock contention as I believe writes are happening on both the active and passive site:
14:05:45,567 WARN (jgroups-239,infinispan-0-40258) [org.infinispan.CLUSTER] ISPN000071: Caught exception when handling command SingleXSiteRpcCommand{command=RemoveCommand{key=WrappedByteArray[0304090000000E\j\a\v\a\.\u\t\i\l\.\U\U\I\DBC9903F798\m85\/000000020000000C\l\e\a\s\t\S\i\g\B\i\t\s\$000000000B\m\o\s\t\S\i\g\B\i... (85 bytes)], value=null, returnEntry=false, metadata=null, internalMetadata=null, flags=[], commandInvocationId=CommandInvocation:7e2b2d82-8f0c-42ce-8eed-bb3a0d057e98:1305642, valueMatcher=MATCH_ALWAYS, topologyId=-1}} org.infinispan.remoting.RemoteException: ISPN000217: Received exception from infinispan-2-46837, see cause for remote stack trace
at org.infinispan.remoting.transport.ResponseCollectors.wrapRemoteException(ResponseCollectors.java:25)
at org.infinispan.remoting.transport.ValidSingleResponseCollector.withException(ValidSingleResponseCollector.java:37)
at org.infinispan.remoting.transport.ValidSingleResponseCollector.addResponse(ValidSingleResponseCollector.java:21)
at org.infinispan.remoting.transport.impl.SingleTargetRequest.addResponse(SingleTargetRequest.java:75)
at org.infinispan.remoting.transport.impl.SingleTargetRequest.onResponse(SingleTargetRequest.java:45)
at org.infinispan.remoting.transport.impl.RequestRepository.addResponse(RequestRepository.java:51)
at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1578)
at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processMessage(JGroupsTransport.java:1480)
at org.infinispan.remoting.transport.jgroups.JGroupsTransport$ChannelCallbacks.up(JGroupsTransport.java:1685)
at org.jgroups.JChannel.up(JChannel.java:733)
at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:921)
at org.jgroups.protocols.relay.RELAY2.up(RELAY2.java:527)
at org.jgroups.protocols.FRAG2.up(FRAG2.java:138)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:245)
at org.jgroups.protocols.FlowControl.up(FlowControl.java:245)
at org.jgroups.protocols.pbcast.GMS.up(GMS.java:845)
at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:226)
at org.jgroups.protocols.UNICAST3.deliverMessage(UNICAST3.java:1052)
at org.jgroups.protocols.UNICAST3.addMessage(UNICAST3.java:794)
at org.jgroups.protocols.UNICAST3.handleDataReceived(UNICAST3.java:776)
at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:425)
at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:658)
at org.jgroups.protocols.VERIFY_SUSPECT2.up(VERIFY_SUSPECT2.java:105)
at org.jgroups.protocols.FailureDetection.up(FailureDetection.java:180)
at org.jgroups.protocols.FD_SOCK2.up(FD_SOCK2.java:188)
at org.jgroups.protocols.MERGE3.up(MERGE3.java:274)
at org.jgroups.protocols.Discovery.up(Discovery.java:294)
at org.jgroups.stack.Protocol.up(Protocol.java:314)
at org.jgroups.protocols.TP.passMessageUp(TP.java:1178)
at org.jgroups.util.SubmitToThreadPool$SingleMessageHandler.run(SubmitToThreadPool.java:100)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000299: Unable to acquire lock after 10 seconds for key WrappedByteArray[0304090000000E\j\a\v\a\.\u\t\i\l\.\U\U\I\DBC9903F798\m85\/000000020000000C\l\e\a\s\t\S\i\g\B\i\t\s\$000000000B\m\o\s\t\S\i\g\B\i... (85 bytes)] and requestor CommandInvocation:infinispan-0-40258:832116. Lock is held by CommandInvocation:infinispan-2-46837:821621
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:480)
at org.infinispan.marshall.exts.ThrowableExternalizer.newThrowableInstance(ThrowableExternalizer.java:287)
at org.infinispan.marshall.exts.ThrowableExternalizer.readGenericThrowable(ThrowableExternalizer.java:265)
at org.infinispan.marshall.exts.ThrowableExternalizer.readObject(ThrowableExternalizer.java:240)
at org.infinispan.marshall.exts.ThrowableExternalizer.readObject(ThrowableExternalizer.java:44)
at org.infinispan.marshall.core.GlobalMarshaller.readWithExternalizer(GlobalMarshaller.java:727)
at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:708)
at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:357)
at org.infinispan.marshall.core.BytesObjectInput.readObject(BytesObjectInput.java:32)
at org.infinispan.remoting.responses.ExceptionResponse$Externalizer.readObject(ExceptionResponse.java:49)
at org.infinispan.remoting.responses.ExceptionResponse$Externalizer.readObject(ExceptionResponse.java:41)
at org.infinispan.marshall.core.GlobalMarshaller.readWithExternalizer(GlobalMarshaller.java:727)
at org.infinispan.marshall.core.GlobalMarshaller.readNonNullableObject(GlobalMarshaller.java:708)
at org.infinispan.marshall.core.GlobalMarshaller.readNullableObject(GlobalMarshaller.java:357)
at org.infinispan.marshall.core.GlobalMarshaller.objectFromObjectInput(GlobalMarshaller.java:191)
at org.infinispan.marshall.core.GlobalMarshaller.objectFromByteBuffer(GlobalMarshaller.java:220)
at org.infinispan.remoting.transport.jgroups.JGroupsTransport.processResponse(JGroupsTransport.java:1570)
... 26 more
Failover cmd:
PROJECT=remerson-keycloak DOMAIN=ryan-active-ryan-passive-mtyxmdck.keycloak-benchmark.com FAILOVER_MODE="ALL_ROUTES" FAILOVER_DELAY=120 ./bin/kc-failover.sh
Route53 Failover time: 57 Seconds
Detailed results and exported grafana dashbboards an be found on the google drive
The Grafana snapshots in failover-all-routes.tar.gz were not correctly exported.
Route53 Failover time: 58 Seconds
Following on from our discussions this afternoon. The failover now occurs after 4 minutes and the benchmark has been updated to use the following parameters:
./benchmark.sh ${REGION} \
--scenario=keycloak.scenario.authentication.AuthorizationCode \
--server-url=${KC_CLIENT_URL} \
--users-per-sec=300 \
--measurement=600 \
--realm-name=realm-0 \
--logout-percentage=100 \
--users-per-realm=10000 \
--ramp-up=100 \
--log-http-on-failure
As hoped, utilising --logout-percentage=100
has removed the build up of active-users towards the end of the benchmark.
In the infinispan logs there were no lock acquisition exceptions as seen in the first run, even though I ran the benchmark across two EC2 nodes as before.
Utilising the following benchmark with two EC2 clients:
./benchmark.sh ${REGION} \
--scenario=keycloak.scenario.authentication.AuthorizationCode \
--server-url=${KC_CLIENT_URL} \
--users-per-sec=300 \
--measurement=600 \
--realm-name=realm-0 \
--logout-percentage=100 \
--users-per-realm=10000 \
--ramp-up=100 \
--log-http-on-failure
Kill the connection between active and passive sites to simulate a connection disruption. The idea is to remove the gossip router public route (we probably also need to kill/stop the Infinispan operator to make sure the operator does not recreate the route)
Gossip Router pods on the active cluster are killed to simulate a network split between the active and passive site. Removing the GossipRouter Route is not sufficient, as the TCP connections established between the two sites remain open until the GR pod is killed.
Observe the active site - how long does it take for the active site to detect passive is down, does the active site work correctly without the passive
Requests to the active site fail for ~ 45 seconds, after which the passive site is suspected and the active site continues to function.
~Grafana hadn't installed correctly on my active site, so unfortunately the dashboards are missing from the results. I can repeat the experiment as required though.~
I have dashboards for the passive site, just not the active. Uploaded tar has been updated.
All Infinispan and Keycloak pods are killed on the passive site to simulate the site going down.
As expected this results in comparable results to the Active Gossip Router failure scenario as xsite operations between the Active and Passive site fail until the passive site becomes suspected by JGroups.
Failover cmd:
DOMAIN=ryan-active-ryan-passive-mzeymtmk.keycloak-benchmark.com PROJECT=remerson-keycloak FAILOVER_MODE=CLUSTER_FAIL FAILOVER_DELAY=240 ./bin/kc-failover.sh
Discussion:
(Reducing the downtime from now 40 seconds to then 15 seconds requires a change in the Infinispan Operator)
Alternative: go to async mode, with the tradeoff that not all updates in active are available in the passive site. In async mode, a resync is necessary when the passive side is completely restarted. Keys are kept in memory for the queue for things to be sent.
Check with @stianst to verify if this is acceptable.
Most "sensible scenario":
Why customers want a SYNC setup:
Follow-ups:
Follow-up tasks have been captured in new issues, closing this issue.
Test scenarios:
Manually switch from the active to the passive site under load
Make the passive site fail