Closed Sovietaced closed 3 years ago
Digging deeper into the logs I am seeing a sequence of events.
There is DNS change.
Detected DNS change. Slave <redacted> has changed ip from <redacted>.104 to <redacted>.122
New connections are established.
24 connections initialized for <redacted>.122:6379
New MasterSlave entry
<redacted>.122:6379 used as slave
master.<redacted>.122:6379 has changed to replica.<redacted>.122:6379
Steps 2,3 repeat for a few hours and rack up thousands of connections. master keeps being changed to the replica for some reason.
Looking at the recent release it seems there are a number of fixes in this area.
Fixed - don't add Redis Slave as active if connections can't be established (thanks to @yann9) Fixed - continuous reconnecting to broken host if it was defined as hostname in Redisson Cluster config Fixed - Redisson doesn't reconnect slave if it was excluded before due to errors in failedSlaveCheckInterval time range. (thanks to @mikawudi)
We are using a hostname in our config and we seem to be reconnecting endlessly (and not cleaning up old connections). I will give the latest version a shot.
I have also found this error today in my application of 10 pods. 2 pods always showed this error. And after I shutting down that 2 pods the error resolved. I have no idea also. My redisson version is 3.15.0 and I see that reddisson have resolved this issue in 3.13.3, but maybe not actually.
I have located that in org.redisson.connection.balancer.LoadBalancerManager#unfreeze(org.redisson.connection.ClientConnectionsEntry, org.redisson.connection.ClientConnectionsEntry.FreezeReason) method the exception have been swallowed like below:
In my opinion, I think this place can throw the ex out, and in org.redisson.connection.pool.ConnectionPool#scheduleCheck method try cacth it and recall scheduleCheck() method.
I have located that in org.redisson.connection.balancer.LoadBalancerManager#unfreeze(org.redisson.connection.ClientConnectionsEntry, org.redisson.connection.ClientConnectionsEntry.FreezeReason) method the exception have been swallowed like below
That method was improved in https://github.com/redisson/redisson/pull/3455. Can you give 3.15.2 version a try?
I have located that in org.redisson.connection.balancer.LoadBalancerManager#unfreeze(org.redisson.connection.ClientConnectionsEntry, org.redisson.connection.ClientConnectionsEntry.FreezeReason) method the exception have been swallowed like below
That method was improved in #3455. Can you give 3.15.2 version a try?
Hello, @mrniko I have saw the 3.15.2 version. The ex is only dealed by resetting the initialized flag to false in initCallBack. But no other code to reup the slave. Next is my test case of 3 masters and 3 slaves:
manually
;no available Redis entries....
;By this way, I guess there is no strategy to unfreeze the slave entry automatically, and it cannot recover forever unless reboot the application.
After I study the code of slaveUp() carefully, I think we can modify the result definition of LoadBalancerManager#unfreeze(ClientConnectionsEntry, FreezeReason)() method:
Because I think this method should guarantee idempotence. It should always return true if the slave have reconnected successfully even if it being called many times.
pseudo-code like below(future.isCompletedExceptionally()):
Then in the method ConnectionPool#scheduleCheck() we can add some code to recall scheduleCheck() if slaveUp() returns false, like below:
But I am not sure if modifying like this will bring new errors, because unfreeze() method have been called by many code. I need your help.
I guess there is no strategy to unfreeze the slave entry automatically, and it cannot recover forever unless reboot the application.
It will be unfreezed by org.redisson.connection.pool.ConnectionPool#scheduleCheck()
method. That controlled by failedSlaveReconnectionInterval
setting.
I agree with changes in ConnectionPool#scheduleCheck()
method.
As for future.isCompletedExceptionally()
check it will return false since promise won't be completed immediately.
As for
future.isCompletedExceptionally()
check it will return false since promise won't be completed immediately.
Yes, future.isCompletedExceptionally()
is disputable. We can add promise.sync()
or some other similar methods before calling it. But we should pay attention to if promise.sync()
similar methods will throw an exception.
We are using a hostname in our config and we seem to be reconnecting endlessly (and not cleaning up old connections). I will give the latest version a shot.
We have now reproduced this issue again on 3.15.5. Unfortunately this happened to our production environment. It appears that AWS Elasticache applies some service updates to the AWS Elasticache automatically without warning and this results in a cluster update. Ultimately, I believe this causes a failover where one of the replicated cluster nodes becomes the new master node and the DNS for the primary endpoint changes.
I will try reproducing the issue by performing a manual failover on our development environment. If i can reproduce the issue I will work on a fix since this client behavior is extremely disruptive.
To provide an update here, this appears to happen when I use the primary and replica endpoints provided by AWS Elasticache. If I configure the Redis client with the explicit node endpoints, there is no rampant connection creation.
Hi! We got again such issue in production environment. This happened on 'SlaveConnectionPool no available Redis entries' exception. I see this issue has been closed, but I think there is still something to investigate in such topic. Regards
@rjtokenring
Which version?
@Sovietaced
For replicated AWS Elasticache you need to define all nodes.
Expected behavior
Redisson client should be stable.
Actual behavior
Over the weekend we noticed that roughly 50,000 connections were made from Redisson clients or our AWS Elasticache Redis cluster. Typically our Redis cluster has a little over 1000 client connections. This introduced a lot of load on the Redis cluster until requests started timing out. Rebooting applications that use Redis seemed to remediate the issue. There are still some Redisson clients in a broken state.
One service is repeatedly logging the following a day later and is in a broken state.
I worked with our SRE team and with the use of
netstat
we discovered that this application was holding 2400 open connections to the Redis cluster! Typically each application will hold around 28 connections.Rebooting the application resolves the issue, which seems to indicate it is a client side issue.
I looked back to when the issue started to occur and see no logs about creating new client connections. All I see are Redis operations beginning to timeout.
Steps to reproduce or test case
Unclear. We have been running this version of Redisson for months without issue. Typically our services are running for a maximum of 7 days.
Redis version
6.0.5
Redisson version
3.14.1
Redisson configuration