spring-projects / spring-data-redis

Provides support to increase developer productivity in Java when using Redis, a key-value store. Uses familiar Spring concepts such as a template classes for core API usage and lightweight repository style data access.
https://spring.io/projects/spring-data-redis/
Apache License 2.0
1.77k stars 1.17k forks source link

Reactive redis hangs forever and cause deadlock #2179

Open coney opened 3 years ago

coney commented 3 years ago

Bug Report

LettuceConnectionFactory.SharedConnection#resetConnection hangs forever and cause deadlock

Current Behavior

I have enabled validateConnection for Lettuce connection factory, and occasionally my service can't serve any incoming request. The thread dump shows that all the http threads are waiting for the connection

Stack trace ``` // http threads, take one for example "reactor-http-epoll-6" #126 daemon prio=5 os_prio=0 cpu=16164.68ms elapsed=26788.53s allocated=1510M defined_classes=693 tid=0x0000560e1cfc1000 nid=0x168b waiting for monitor entry [0x00007fdb977c2000] java.lang.Thread.State: BLOCKED (on object monitor) at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.getConnection(LettuceConnectionFactory.java:1295) - waiting to lock <0x000000070a63d728> (a java.lang.Object) at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getSharedReactiveConnection(LettuceConnectionFactory.java:1049) at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveClusterConnection(LettuceConnectionFactory.java:481) at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveConnection(LettuceConnectionFactory.java:457) at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveConnection(LettuceConnectionFactory.java:101) at org.springframework.data.redis.core.ReactiveRedisTemplate.lambda$doInConnection$0(ReactiveRedisTemplate.java:198) at org.springframework.data.redis.core.ReactiveRedisTemplate$$Lambda$773/0x00000008007edc40.get(Unknown Source) at reactor.core.publisher.MonoSupplier.call(MonoSupplier.java:85) at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:224) at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onComplete(MonoIgnoreThen.java:203) ``` And all https threads are waiting for a lock which hold by the thread as below: ``` "lettuce-epollEventLoop-5-1" #31 daemon prio=5 os_prio=0 cpu=7049.44ms elapsed=26823.40s allocated=1441M defined_classes=171 tid=0x0000560e1dd67000 nid=0x13de waiting on condition [0x00007fdbb8753000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.8/Native Method) - parking to wait for <0x00000007197dec70> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(java.base@11.0.8/Unknown Source) at java.util.concurrent.CompletableFuture$Signaller.block(java.base@11.0.8/Unknown Source) at java.util.concurrent.ForkJoinPool.managedBlock(java.base@11.0.8/Unknown Source) at java.util.concurrent.CompletableFuture.waitingGet(java.base@11.0.8/Unknown Source) at java.util.concurrent.CompletableFuture.join(java.base@11.0.8/Unknown Source) at org.springframework.data.redis.connection.lettuce.LettuceFutureUtils.join(LettuceFutureUtils.java:68) at org.springframework.data.redis.connection.lettuce.LettuceConnectionProvider.release(LettuceConnectionProvider.java:74) at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$ExceptionTranslatingConnectionProvider.release(LettuceConnectionFactory.java:1596) at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.resetConnection(LettuceConnectionFactory.java:1360) - locked <0x000000070a63d728> (a java.lang.Object) at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.validateConnection(LettuceConnectionFactory.java:1346) - locked <0x000000070a63d728> (a java.lang.Object) at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$SharedConnection.getConnection(LettuceConnectionFactory.java:1302) - locked <0x000000070a63d728> (a java.lang.Object) at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getSharedReactiveConnection(LettuceConnectionFactory.java:1049) at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveClusterConnection(LettuceConnectionFactory.java:481) at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveConnection(LettuceConnectionFactory.java:457) at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.getReactiveConnection(LettuceConnectionFactory.java:101) at org.springframework.data.redis.core.ReactiveRedisTemplate.lambda$doInConnection$0(ReactiveRedisTemplate.java:198) at org.springframework.data.redis.core.ReactiveRedisTemplate$$Lambda$773/0x00000008007edc40.get(Unknown Source) ```

Input Code

Input Code Our application is using webflux to handle API request's, but I found that lettuce using `synchronized` to protect getConnection: ``` java // org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.SharedConnection#getConnection @Nullable StatefulConnection getConnection() { synchronized (this.connectionMonitor) { if (this.connection == null) { this.connection = getNativeConnection(); } if (getValidateConnection()) { validateConnection(); } return this.connection; } } ``` And inside the `validateConnection` the `resetConnection` hangs: ``` java // org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory.SharedConnection#validateConnection void validateConnection() { synchronized (this.connectionMonitor) { boolean valid = false; if (connection != null && connection.isOpen()) { try { if (connection instanceof StatefulRedisConnection) { ((StatefulRedisConnection) connection).sync().ping(); } if (connection instanceof StatefulRedisClusterConnection) { ((StatefulRedisClusterConnection) connection).sync().ping(); } valid = true; } catch (Exception e) { log.debug("Validation failed", e); } } if (!valid) { log.info("Validation of shared connection failed. Creating a new connection."); // the line below hangs resetConnection(); this.connection = getNativeConnection(); } } } ```

Expected behavior/code

reset connection could be over in time and no deadlock.

Environment

redis relevant configuration:

spring.redis.cluster.nodes={{spring_redis_cluster_nodes}} // we have 6 nodes
spring.redis.password={{spring_redis_password}}
spring.redis.cluster.max-redirects=5
spring.redis.cluster.topology-refresh-interval=10
spring.redis.lettuce.pool.min-idle=500
spring.redis.lettuce.pool.max-active=5000
spring.redis.lettuce.pool.max-wait=-1
spring.redis.lettuce.pool.max-idle=1000
spring.redis.timeout=10000
spring.redis.database=0 

Possible Solution

In org.springframework.data.redis.connection.lettuce.LettuceConnectionProvider#release, seems that it will wait for future forever, maybe a timeout could partially avoid this situation? Still don't know why release hangs.

    default void release(StatefulConnection<?, ?> connection) {
        LettuceFutureUtils.join(releaseAsync(connection));
    }

Additional context

stacktrace.zip

Reference

The original issue was posted on https://github.com/lettuce-io/lettuce-core/issues/1861

mp911de commented 3 years ago

Thanks for reporting the issue. Connection validation is synchronous while reactive connections are using non-blocking API. What happens here is that the I/O thread is blocked and cannot proceed with connection validation or creation. Connection validation isn't necessary for Lettuce as Lettuce auto-reconnects disconnected connections. If a connection is truly broken then either because the Redis server is down or due to a network partition. Both scenarios cannot be recovered from the client-side.

Therefore, please disable validateConnection and make sure to enable early connection initialization to prevent blocking of the event loop thread.