Redisson client opens thousands of connections, still disconnected from cluster

Sovietaced commented 3 years ago

Expected behavior

Redisson client should be stable.

Actual behavior

Over the weekend we noticed that roughly 50,000 connections were made from Redisson clients or our AWS Elasticache Redis cluster. Typically our Redis cluster has a little over 1000 client connections. This introduced a lot of load on the Redis cluster until requests started timing out. Rebooting applications that use Redis seemed to remediate the issue. There are still some Redisson clients in a broken state.

Screenshot from 2021-06-07 12-34-27

Screenshot from 2021-06-07 12-35-21

One service is repeatedly logging the following a day later and is in a broken state.


org.redisson.client.RedisConnectionException: SlaveConnectionPool no available Redis entries. Master entry host: <redacted>:6379 Disconnected hosts: [<redacted>:6379] Hosts disconnected due to errors during `failedSlaveCheckInterval`: [<redacted>:6379]    
at org.redisson.connection.pool.ConnectionPool.get(ConnectionPool.java:218)     
at org.redisson.connection.pool.SlaveConnectionPool.get(SlaveConnectionPool.java:30)    
at org.redisson.connection.balancer.LoadBalancerManager.nextConnection(LoadBalancerManager.java:267)    
at org.redisson.connection.MasterSlaveEntry.connectionReadOp(MasterSlaveEntry.java:500)     
at org.redisson.connection.MasterSlaveConnectionManager.connectionReadOp(MasterSlaveConnectionManager.java:572)     
at org.redisson.command.RedisExecutor.getConnection(RedisExecutor.java:648)     
at org.redisson.command.RedisExecutor.execute(RedisExecutor.java:116)   
at org.redisson.command.RedisExecutor$2.run(RedisExecutor.java:244)     
at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672) 
at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)   
at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)     
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)    
at java.lang.Thread.run(Unknown Source)

I worked with our SRE team and with the use of netstat we discovered that this application was holding 2400 open connections to the Redis cluster! Typically each application will hold around 28 connections.

Rebooting the application resolves the issue, which seems to indicate it is a client side issue.

I looked back to when the issue started to occur and see no logs about creating new client connections. All I see are Redis operations beginning to timeout.

Steps to reproduce or test case

Unclear. We have been running this version of Redisson for months without issue. Typically our services are running for a maximum of 7 days.

Redis version

6.0.5

Redisson version

3.14.1

Redisson configuration

  public RedissonClient build() {

    final Config config = new Config();
    final ReplicatedServersConfig replicatedConfig = config.useReplicatedServers();
    nodes.forEach(replicatedConfig::addNodeAddress);

    if (!Strings.isNullOrEmpty(password)) {
      replicatedConfig.setPassword(password);
    }

    final RedissonClient redissonClient = Redisson.create(config);

    // If tracer is configured, wrap the client.
    if (tracer != null) {
      final TracingConfiguration tracingConfiguration = new TracingConfiguration.Builder(tracer)
          .build();
      return new TracingRedissonClient(redissonClient, tracingConfiguration);
    }

    return redissonClient;
  }

Sovietaced commented 3 years ago

Digging deeper into the logs I am seeing a sequence of events.

There is DNS change.

Detected DNS change. Slave <redacted> has changed ip from <redacted>.104 to <redacted>.122

New connections are established.

24 connections initialized for <redacted>.122:6379

New MasterSlave entry

<redacted>.122:6379 used as slave
master.<redacted>.122:6379 has changed to replica.<redacted>.122:6379

Steps 2,3 repeat for a few hours and rack up thousands of connections. master keeps being changed to the replica for some reason.

Sovietaced commented 3 years ago

Looking at the recent release it seems there are a number of fixes in this area.

Fixed - don't add Redis Slave as active if connections can't be established (thanks to @yann9) Fixed - continuous reconnecting to broken host if it was defined as hostname in Redisson Cluster config Fixed - Redisson doesn't reconnect slave if it was excluded before due to errors in failedSlaveCheckInterval time range. (thanks to @mikawudi)

We are using a hostname in our config and we seem to be reconnecting endlessly (and not cleaning up old connections). I will give the latest version a shot.

alan-tang-tt commented 3 years ago

I have also found this error today in my application of 10 pods. 2 pods always showed this error. And after I shutting down that 2 pods the error resolved. I have no idea also. My redisson version is 3.15.0 and I see that reddisson have resolved this issue in 3.13.3, but maybe not actually.

alan-tang-tt commented 3 years ago

I have located that in org.redisson.connection.balancer.LoadBalancerManager#unfreeze(org.redisson.connection.ClientConnectionsEntry, org.redisson.connection.ClientConnectionsEntry.FreezeReason) method the exception have been swallowed like below:

In my opinion, I think this place can throw the ex out, and in org.redisson.connection.pool.ConnectionPool#scheduleCheck method try cacth it and recall scheduleCheck() method.

mrniko commented 3 years ago

I have located that in org.redisson.connection.balancer.LoadBalancerManager#unfreeze(org.redisson.connection.ClientConnectionsEntry, org.redisson.connection.ClientConnectionsEntry.FreezeReason) method the exception have been swallowed like below

That method was improved in https://github.com/redisson/redisson/pull/3455. Can you give 3.15.2 version a try?

alan-tang-tt commented 3 years ago

I have located that in org.redisson.connection.balancer.LoadBalancerManager#unfreeze(org.redisson.connection.ClientConnectionsEntry, org.redisson.connection.ClientConnectionsEntry.FreezeReason) method the exception have been swallowed like below

That method was improved in #3455. Can you give 3.15.2 version a try?

Hello, @mrniko I have saw the 3.15.2 version. The ex is only dealed by resetting the initialized flag to false in initCallBack. But no other code to reup the slave. Next is my test case of 3 masters and 3 slaves:

put a breakpoint inConnectionPool#get() method;
launch a 'get' request to redis, then it will stop in the breakpoint;
look at the entriesCopy, it will have two values, one master one slave;
modify the freezeReason to RECONNECT of slave entry manually;
let the request continue, then it will throw the no available Redis entries....;
try get request several times, and waiting a minutes, retry get request, look at the slave freezeReason, it's still exist;

By this way, I guess there is no strategy to unfreeze the slave entry automatically, and it cannot recover forever unless reboot the application.

After I study the code of slaveUp() carefully, I think we can modify the result definition of LoadBalancerManager#unfreeze(ClientConnectionsEntry, FreezeReason)() method:

true, if the slave entry have reconnected successfully, not only reconnect successfully this time, maybe reconnect successfully before.
false, if the slave entry reconnects error this time.

Because I think this method should guarantee idempotence. It should always return true if the slave have reconnected successfully even if it being called many times.

pseudo-code like below（future.isCompletedExceptionally()）:

Then in the method ConnectionPool#scheduleCheck() we can add some code to recall scheduleCheck() if slaveUp() returns false, like below:

But I am not sure if modifying like this will bring new errors, because unfreeze() method have been called by many code. I need your help.

mrniko commented 3 years ago

I guess there is no strategy to unfreeze the slave entry automatically, and it cannot recover forever unless reboot the application.

It will be unfreezed by org.redisson.connection.pool.ConnectionPool#scheduleCheck() method. That controlled by failedSlaveReconnectionInterval setting.

I agree with changes in ConnectionPool#scheduleCheck() method.

As for future.isCompletedExceptionally() check it will return false since promise won't be completed immediately.

alan-tang-tt commented 3 years ago

As for future.isCompletedExceptionally() check it will return false since promise won't be completed immediately.

Yes, future.isCompletedExceptionally() is disputable. We can add promise.sync() or some other similar methods before calling it. But we should pay attention to if promise.sync() similar methods will throw an exception.

Sovietaced commented 3 years ago

We are using a hostname in our config and we seem to be reconnecting endlessly (and not cleaning up old connections). I will give the latest version a shot.

We have now reproduced this issue again on 3.15.5. Unfortunately this happened to our production environment. It appears that AWS Elasticache applies some service updates to the AWS Elasticache automatically without warning and this results in a cluster update. Ultimately, I believe this causes a failover where one of the replicated cluster nodes becomes the new master node and the DNS for the primary endpoint changes.

I will try reproducing the issue by performing a manual failover on our development environment. If i can reproduce the issue I will work on a fix since this client behavior is extremely disruptive.

Sovietaced commented 3 years ago

To provide an update here, this appears to happen when I use the primary and replica endpoints provided by AWS Elasticache. If I configure the Redis client with the explicit node endpoints, there is no rampant connection creation.

rjtokenring commented 2 years ago

Hi! We got again such issue in production environment. This happened on 'SlaveConnectionPool no available Redis entries' exception. I see this issue has been closed, but I think there is still something to investigate in such topic. Regards

mrniko commented 2 years ago

@rjtokenring

Which version?

mrniko commented 2 years ago

@Sovietaced

For replicated AWS Elasticache you need to define all nodes.

redisson / redisson

Redisson client opens thousands of connections, still disconnected from cluster #3646