spring-projects / spring-data-redis

Provides support to increase developer productivity in Java when using Redis, a key-value store. Uses familiar Spring concepts such as a template classes for core API usage and lightweight repository style data access.
https://spring.io/projects/spring-data-redis/
Apache License 2.0
1.77k stars 1.17k forks source link

Connection hangs forever using lettuce and timeouts are not working #2863

Closed KP-EV closed 8 months ago

KP-EV commented 8 months ago

Hi,

we use spring-boot-starter-data-redis:3.2.2 with a single redis instance and configured these timeouts:

spring:
  data:
    redis:
      host: "${REDIS_HOST}" #ip address
      port: "6379"
      timeout: 150ms
      connect-timeout: 200ms
      lettuce:
        pool:
          enabled: true
          max-active: 3
          max-wait: 150ms

But we still notice that sometimes, after the application ran for a day, that the redis connection just completely dies and any call using @Cacheable annotation just hangs forever.

Looking into making the connection more reliable via this thread here: https://github.com/lettuce-io/lettuce-core/issues/1428

But iam also wondering why these three timeouts are not working and the connection still hangs forever. Any other call works fine and after a restart the application is working again.

mp911de commented 8 months ago

Please attach a thread dump for further diagnosis.

KP-EV commented 8 months ago

dump2.txt

Thanks for the quick response! :)

mp911de commented 8 months ago

the offending thread has this trace:

"lettuce-nioEventLoop-4-2" #80 [146] daemon prio=5 os_prio=0 cpu=17071.84ms elapsed=63349.49s tid=0x000078f8b8051170 nid=146 waiting on condition  [0x000078f9b41fd000]
   java.lang.Thread.State: WAITING (parking)
    at jdk.internal.misc.Unsafe.park(java.base@21.0.2/Native Method)
    - parking to wait for  <0x00000000d09834f8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(java.base@21.0.2/LockSupport.java:221)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@21.0.2/AbstractQueuedSynchronizer.java:754)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@21.0.2/AbstractQueuedSynchronizer.java:990)
    at java.util.concurrent.locks.ReentrantLock$Sync.lock(java.base@21.0.2/ReentrantLock.java:153)
    at java.util.concurrent.locks.ReentrantLock.lock(java.base@21.0.2/ReentrantLock.java:322)
    at java.lang.ref.ReferenceQueue.poll(java.base@21.0.2/ReferenceQueue.java:182)
    at io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap.expungeStaleEntries(AbstractWeakConcurrentMap.java:235)
    at io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.WeakConcurrentMap$WithInlinedExpunction.getIfPresent(WeakConcurrentMap.java:193)
    at io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.WeakLockFreeCache.get(WeakLockFreeCache.java:26)
    at io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$lang$Runnable$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.mapGet(VirtualFieldImplementationsGenerator.java:295)
    at io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$lang$Runnable$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.realGet(VirtualFieldImplementationsGenerator.java)
    at io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$lang$Runnable$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.get(VirtualFieldImplementationsGenerator.java:277)
    at io.opentelemetry.javaagent.bootstrap.executors.ExecutorAdviceHelper.attachContextToTask(ExecutorAdviceHelper.java:51)
    at java.util.concurrent.ForkJoinPool.execute(java.base@21.0.2/ForkJoinPool.java:2847)
    at java.lang.VirtualThread.submitRunContinuation(java.base@21.0.2/VirtualThread.java:240)
    at java.lang.VirtualThread.unpark(java.base@21.0.2/VirtualThread.java:734)
    at java.lang.System$2.unparkVirtualThread(java.base@21.0.2/System.java:2660)
    at jdk.internal.misc.VirtualThreads.unpark(java.base@21.0.2/VirtualThreads.java:93)
    at java.util.concurrent.locks.LockSupport.unpark(java.base@21.0.2/LockSupport.java:179)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.signalNext(java.base@21.0.2/AbstractQueuedSynchronizer.java:645)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.release(java.base@21.0.2/AbstractQueuedSynchronizer.java:1060)
    at java.util.concurrent.locks.ReentrantLock.unlock(java.base@21.0.2/ReentrantLock.java:494)
    at java.lang.ref.ReferenceQueue.poll(java.base@21.0.2/ReferenceQueue.java:186)
    at io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.AbstractWeakConcurrentMap.expungeStaleEntries(AbstractWeakConcurrentMap.java:235)
    at io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.weaklockfree.WeakConcurrentMap$WithInlinedExpunction.getIfPresent(WeakConcurrentMap.java:193)
    at io.opentelemetry.javaagent.shaded.instrumentation.api.internal.cache.WeakLockFreeCache.get(WeakLockFreeCache.java:26)
    at io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$lang$Runnable$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.mapGet(VirtualFieldImplementationsGenerator.java:295)
    at io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$lang$Runnable$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.realGet(VirtualFieldImplementationsGenerator.java)
    at io.opentelemetry.javaagent.bootstrap.field.VirtualFieldImpl$java$lang$Runnable$io$opentelemetry$javaagent$bootstrap$executors$PropagatedContext.get(VirtualFieldImplementationsGenerator.java:277)
    at io.opentelemetry.javaagent.bootstrap.executors.ExecutorAdviceHelper.attachContextToTask(ExecutorAdviceHelper.java:51)
    at java.util.concurrent.ForkJoinPool.execute(java.base@21.0.2/ForkJoinPool.java:2847)
    at java.lang.VirtualThread.submitRunContinuation(java.base@21.0.2/VirtualThread.java:240)
    at java.lang.VirtualThread.unpark(java.base@21.0.2/VirtualThread.java:734)
    at java.lang.System$2.unparkVirtualThread(java.base@21.0.2/System.java:2660)
    at jdk.internal.misc.VirtualThreads.unpark(java.base@21.0.2/VirtualThreads.java:93)
    at java.util.concurrent.locks.LockSupport.unpark(java.base@21.0.2/LockSupport.java:179)
    at java.util.concurrent.CompletableFuture$Signaller.tryFire(java.base@21.0.2/CompletableFuture.java:1848)
    at java.util.concurrent.CompletableFuture.postComplete(java.base@21.0.2/CompletableFuture.java:510)
    at java.util.concurrent.CompletableFuture.complete(java.base@21.0.2/CompletableFuture.java:2179)
    at io.lettuce.core.protocol.AsyncCommand.completeResult(AsyncCommand.java:122)
    at io.lettuce.core.protocol.AsyncCommand.complete(AsyncCommand.java:111)
    at io.lettuce.core.protocol.CommandWrapper.complete(CommandWrapper.java:63)
    at io.lettuce.core.protocol.CommandWrapper.complete(CommandWrapper.java:63)
    at io.lettuce.core.protocol.CommandHandler.complete(CommandHandler.java:745)
    at io.lettuce.core.protocol.CommandHandler.decode(CommandHandler.java:680)
    at io.lettuce.core.protocol.CommandHandler.channelRead(CommandHandler.java:597)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.runWith(java.base@21.0.2/Thread.java:1596)
    at java.lang.Thread.run(java.base@21.0.2/Thread.java:1583)

Some open telemetry instrumentation blocks the event loop thread so it is out of scope of the driver and out of scope of Spring Data.

Generally, there's some proper tracing support via Micrometer Tracing that doesn't require instrumentation by a Java agent.

KP-EV commented 8 months ago

thanks for your help, ill take a look

KP-EV commented 7 months ago

FYI: disabling the redis implementation of the otel client via OTEL_INSTRUMENTATION_LETTUCE_ENABLED: 'false' worked fine. We didnt had this issue anymore. Thank you for your help :)