Closed kurtharriger closed 5 years ago
Thanks for reaching out. There are two issues from the report:
Master is currently unknown
You're using a static Master/Slave topology for your setup. We consider it static because there's no Redis Sentinel or Redis Cluster manager that can supply runtime topology details. On creating a StatefulRedisMasterSlaveConnection
, Lettuce connects to each host specified via RedisURI
and obtains its role.
Lettuce reuses role definitions as long as the connection is not closed. This also means that Lettuce isn't refreshing topology while a connection is connected.
From the description above, the master node was not connectable so new connection requests weren't able to contact the master, and therefore their topology view did not contain a master
See:
"2019-03-26T04:24:00.327-0600","java.util.concurrent.CompletionException: io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterSlaveNode [redisURI=RedisURI […], role=SLAVE], RedisMasterSlaveNode [redisURI=RedisURI […], role=SLAVE]]
Pool exhausted
This is an assumption: Because the master went dysfunctional, commands on the master didn't complete and connections weren't released to the pool. This report does not contain how you acquire/release a connection, but in most scenarios, you would release a connection back to the pool, once a RedisFuture
completes.
To your question:
Is there something we need to configure differently?
I suggest using Redis Cluster or at least Redis Sentinel as these modes of operation recover from failures because of active topology propagation. Redis Cluster isn't always possible because of limitations in transactions and cross-slot keys in multi-key commands.
Thanks.
I initially assumed that using a multinode cluster with auto-failover would mean we would not need to do anything if we lose the master node, but I see another related issue that seems to indicate this won't happen automatically. https://github.com/lettuce-io/lettuce-core/issues/338 and explains the failure during the manually initiated rollover.
The odd thing about the initial incident is that the AWS confirmed that the master node did not change and thus while the failure due to manual failover appears to require a restart, I'm unclear why if the configuration is static and the master did not change why it still required a service restart to recover?
Could you elaborate on what you mean by "using Redis Cluster" Are we not? aside from using async and a connection pool, our configuration I don't see that it is meaningfully different than Example 3 AWS ElastiCache Cluster described here https://github.com/lettuce-io/lettuce-core/wiki/Master-Slave
It appears from the AWS response their was packet loss so high connection latencies would be expected which is probably what exhausted the pool. Although some packet loss apparently persisted this didn't prevent the service from operating normally after we restarted the nodes via redeploy, so the initial packet loss problem was probably very short.
Our code to acquire and release connections is here https://bitbucket.org/snippets/atlassian/5enz7j I think you'll agree that the connection is released even in exceptional behaviour.
What isn't clear to me is how this code should be changed. Closing the connections would defeat the purpose of using the pool? should not the refreshing of connections, when needed, be handled by the pool?
So there are a few setups where Redis and Cluster are used together. There are four "official" (meaning no external orchestration) modes of Redis operation:
Typically, we see people calling all except Redis Standalone a cluster. AWS provides two models:
Lettuce supports only the built-in Redis operation modes as each cloud provider tends to attach orchestration bits to their service and we do not want to buy into the complexity which goes beyond OSS Redis. Typically, when running Redis with a number of replicas, there is no automatic failover. The only failover that is possible requires intervention with Redis (reconfiguration of the Master/Slave) setup, therefore we call it static in Lettuce. Because there is also no standard mechanism that would notify Lettuce regarding a reconfiguration, there is no way to reconfigure these connections while your application is active. Your application would need to close all connections and reopen these.
Closing as there's nothing left to do.
Bug Report
We had outage related to Redis issue. AWS reported that the master node did experience some packet loss but otherwise was healthy.
The logs basically look like this, starting out with a bunch of Pool exhausted messages and followed by <aster is currently unknown. The issue persisted until we redeployed the application.
(full stack trace here https://bitbucket.org/snippets/atlassian/Ke4GzR)
I opened a ticket with Amazon to determine if the node actually failed.
AWS reply:
At this time we had already recovered, however, per AWS comment that the master node was still experiencing packet loss I decided to failover the master node.
As soon as I failed over the master node we experienced the same issue (thankfully off-peak when dynamodb was able to handle the traffic without throttling requests). Although xxx-002 was promoted to master the logs still reported this node as a slave. It required another redeployment for the client to pickup the new master node.
I'm concerned that the client is unable to handle node failures without manual intervention. Is there something we need to configure differently?
Input Code
Expected behavior/code
The client should gracefully handle master node failover without restarting the client.
Environment
5.1.0.RELEASE AWS ElsastiCache 3 node multi-az cluster - Redis 3.2.4