redis / lettuce

Advanced Java Redis client for thread-safe sync, async, and reactive usage. Supports Cluster, Sentinel, Pipelining, and codecs.
https://lettuce.io
MIT License
5.41k stars 979 forks source link

Periodic cluster topology refresh blocks commands execution. #2908

Open ashok-shukla opened 4 months ago

ashok-shukla commented 4 months ago

Bug Report

Current Behavior

Periodic cluster topology refresh blocks commands execution.

From JFR, 1) topology refresh took 1.5 sec Start Time Duration End Time Event Thread sources 7/5/24, 5:49:06.714 PM 1.500 s 7/5/24, 5:49:08.215 PM lettuce-nioEventLoop-5-2 xxxx,xxxx

2) Thread is parked to execute get command during the topology refresh period

Start Time  Duration    End Time    Event Thread    Address of Object Parked    Class Parked On Park Timeout    Park Until
7/5/24, 5:49:06.734 PM  1.380 s 7/5/24, 5:49:08.115 PM  io-executor-thread-9    0xE9789188  java.util.concurrent.CompletableFuture$Signaller    60.000 s    12/2/55, 4:47:04.192 PM

get command stack:
  Stack Trace   Count   Percentage
  void jdk.internal.misc.Unsafe.park(boolean, long):-1  1   100 %
  void java.util.concurrent.locks.LockSupport.parkNanos(Object, long):269   1   100 %
  boolean java.util.concurrent.CompletableFuture$Signaller.block():1866 1   100 %
  void java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool$ManagedBlocker):3780   1   100 %
  void java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool$ManagedBlocker):3725 1   100 %
  Object java.util.concurrent.CompletableFuture.timedGet(long):1939 1   100 %
  Object java.util.concurrent.CompletableFuture.get(long, TimeUnit):2095    1   100 %
  boolean io.lettuce.core.protocol.AsyncCommand.await(long, TimeUnit):83    1   100 %
  Object io.lettuce.core.internal.Futures.awaitOrCancel(RedisFuture, long, TimeUnit):244    1   100 %
  Object io.lettuce.core.cluster.ClusterFutureSyncInvocationHandler.handleInvocation(Object, Method, Object[]):131  1   100 %
  Object io.lettuce.core.internal.AbstractInvocationHandler.invoke(Object, Method, Object[]):80 1   100 %
  Object jdk.proxy2.$Proxy119.get(Object):-1    1   100 %
  byte[] oracle.oic.v3.mtms.infra.cache.impl.AbstractCacheImpl.lambda$get$8(byte[], String):280 1   100 %
  .....
  .....

Expected behavior/code

As for reference guide, from Lettuce 6.0, Cluster topology refresh is now non-blocking. But from JFR, it shows as blocking.

Environment

Possible Solution

Additional context

We are executing commands using synchronous api. Our redis cluster consists of 9 nodes: 3 master and each master has 2 replicas.

tishun commented 4 months ago

Hey @ashok-shukla ,

what lead you to conclude that the parked thread is handling topology refresh? Is there some minimum reproducible example of the problem we can take a look at?

tishun commented 4 months ago

(Asynchronous topology refresh is handled in #1107)

ashok-shukla commented 4 months ago

parked thread is not handling topology refresh. but waiting for topology refresh.

io-executor-thread-9: is parked from 7/5/24, 5:49:06.734 PM to 7/5/24, 5:49:08.115 PM. lettuce-nioEventLoop-5-2: doing topology refresh from 7/5/24, 5:49:06.714 PM to 7/5/24, 5:49:08.215 PM.

I do not have a minimum reproducible example now. if possible, I will try to create one.

ashok-shukla commented 4 months ago

reproducer: cache.zip

  1. download cache.zip and unzip
  2. update REDIS_HOST and REDIS_PORT in CheckTopologyRefresh.java as per your redis cluster env. Our redis cluster nodes running in kubernetes and we just point to redis cluster kubernetes service as redis host.
  3. Update ssl options as per you setup.
  4. build the project: mvn clean install
  5. run the jar with following command: java -Xms1024m -Xmx1024m -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:GCLockerRetryAllocationCount=100 -XX:+FlightRecorder -XX:StartFlightRecording=duration=300s,filename=lettuce5min.jfr -jar cache-1.0-SNAPSHOT-jar-with-dependencies.jar 5 | tee test5min.log
  6. compare topology refresh time in lettuce5min.jfr with time in test5min.log. Check response time during that period.
tishun commented 4 months ago

Awesome! Much appreciated!