Client does not handle master failover without intervention

kurtharriger commented 5 years ago

Bug Report

We had outage related to Redis issue. AWS reported that the master node did experience some packet loss but otherwise was healthy.

The logs basically look like this, starting out with a bunch of Pool exhausted messages and followed by <aster is currently unknown. The issue persisted until we redeployed the application.

"_time",stack
"2019-03-26T04:23:51.038-0600","java.util.concurrent.CompletionException: java.util.NoSuchElementException: Pool exhausted
   at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
   at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
   at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
   at java.util.concurrent.CompletableFuture.uniApplySt"
"2019-03-26T04:23:51.063-0600","java.util.concurrent.CompletionException: java.util.NoSuchElementException: Pool exhausted
   at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
   at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
   at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
   at java.util.concurrent.CompletableFuture.uniApplySt"

... many more...

"2019-03-26T04:24:00.272-0600","java.util.concurrent.CompletionException: io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterSlaveNode [redisURI=RedisURI [host='XXX-002.qgee1s.0001.euw1.cache.amazonaws.com', port=6379], role=SLAVE], RedisMasterSlaveNode [redisURI=RedisURI [host='XXX-001.qgee1s.0001.euw1.cache.amazonaws.com', port=6379], role=SLAVE]]
   at java.util.concurrent.CompletableFuture.encodeThrowable"
"2019-03-26T04:24:00.276-0600","java.util.concurrent.CompletionException: io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterSlaveNode [redisURI=RedisURI [host='XXX-002.qgee1s.0001.euw1.cache.amazonaws.com', port=6379], role=SLAVE], RedisMasterSlaveNode [redisURI=RedisURI [host='XXX-001.qgee1s.0001.euw1.cache.amazonaws.com', port=6379], role=SLAVE]]
   at java.util.concurrent.CompletableFuture.encodeThrowable"
"2019-03-26T04:24:00.287-0600","java.util.concurrent.CompletionException: java.util.NoSuchElementException: Pool exhausted
   at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
   at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
   at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
   at java.util.concurrent.CompletableFuture.uniApplySt"
"2019-03-26T04:24:00.295-0600","java.util.concurrent.CompletionException: java.util.NoSuchElementException: Pool exhausted
   at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
   at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
   at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
   at java.util.concurrent.CompletableFuture.uniApplySt"
"2019-03-26T04:24:00.296-0600","java.util.concurrent.CompletionException: java.util.NoSuchElementException: Pool exhausted
   at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
   at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
   at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
   at java.util.concurrent.CompletableFuture.uniApplySt"
"2019-03-26T04:24:00.311-0600","java.util.concurrent.CompletionException: io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterSlaveNode [redisURI=RedisURI [host='XXX-001.qgee1s.0001.euw1.cache.amazonaws.com', port=6379], role=SLAVE], RedisMasterSlaveNode [redisURI=RedisURI [host='XXX-002.qgee1s.0001.euw1.cache.amazonaws.com', port=6379], role=SLAVE]]
   at java.util.concurrent.CompletableFuture.encodeThrowable"
"2019-03-26T04:24:00.321-0600","java.util.concurrent.CompletionException: java.util.NoSuchElementException: Pool exhausted
   at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
   at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
   at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
   at java.util.concurrent.CompletableFuture.uniApplySt"
"2019-03-26T04:24:00.327-0600","java.util.concurrent.CompletionException: io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterSlaveNode [redisURI=RedisURI [host='XXX-002.qgee1s.0001.euw1.cache.amazonaws.com', port=6379], role=SLAVE], RedisMasterSlaveNode [redisURI=RedisURI [host='XXX-001.qgee1s.0001.euw1.cache.amazonaws.com', port=6379], role=SLAVE]]
   at java.util.concurrent.CompletableFuture.encodeThrowable"
"2019-03-26T04:24:00.329-0600","java.util.concurrent.CompletionException: io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterSlaveNode [redisURI=RedisURI [host='XXX-002.qgee1s.0001.euw1.cache.amazonaws.com', port=6379], role=SLAVE], RedisMasterSlaveNode [redisURI=RedisURI [host='XXX-001.qgee1s.0001.euw1.cache.amazonaws.com', port=6379], role=SLAVE]]
   at java.util.concurrent.CompletableFuture.encodeThrowable"
"2019-03-26T04:24:00.331-0600","java.util.concurrent.CompletionException: io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterSlaveNode [redisURI=RedisURI [host='XXX-002.qgee1s.0001.euw1.cache.amazonaws.com', port=6379], role=SLAVE], RedisMasterSlaveNode [redisURI=RedisURI [host='XXX-001.qgee1s.0001.euw1.cache.amazonaws.com', port=6379], role=SLAVE]]
   at java.util.concurrent.CompletableFuture.encodeThrowable"

... many more ...

(full stack trace here https://bitbucket.org/snippets/atlassian/Ke4GzR)

I opened a ticket with Amazon to determine if the node actually failed.

At 10:23am gmt time we experienced some connectivity issues with the redis cluster from i-aaa, i-bbb and i-ccc.

Logs were stating we were unable to connect to redis: java.util.concurrent.CompletionException: io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterSlaveNode [redisURI=RedisURI [host='xxx.0001.euw1.cache.amazonaws.com', port=6379], role=SLAVE], RedisMasterSlaveNode [redisURI=RedisURI [host='xxx.qgee1s.0001.euw1.cache.amazonaws.com', port=6379], role=SLAVE]

It looks like most of the redis metrics fall off a cliff at about the same time, in some cases such as cpu and network out these falloff at 10:22am just before we started experiencing issues.

I'm having difficulty determining if this is the cause or symptom of our application issues. Can you help me determine if there were any other related issues with this cluster that this time and if so when were they resolved?

AWS reply:

Hello, Thank you for contacting AWS Premium Support. I took a look at the XXX cluster and here I what I see:

Looking at the ElastiCache backend there were no recent events that affected master node XXX-003 or the other nodes (this can be confirmed in the ElastiCache events[1].)

Master node XXX-003 was the master the entire time when the problem occurred, so this confirms that there were no failovers to the replicas or intermittent problems with replicas becoming the master temporarily.

The cluster had plenty of free memory[3]. Evictions[4] were also not taking place, furthering confirming that the cluster was not running out of memory, either due to the application overloading it or the cluster having an internal problem that would cause high memory usage and/or Redis to crash.

The application started to make new connections[5] to replicas XXX-001 and XXX-002.

An internal metric that I have access to shows that there was some packet loss on the underlying host that master node XXX-003 is running on.

There is a sharp decrease in packets coming into the cluster[6] from the application. It looks like the application stopped trying to contact the cluster at this point. This would cause the CPU, NetworkOut and other metrics for the cluster to be reduced.

So my thinking is that during the time frame of the problem perhaps the application needed to get a list of/determine the roles of the nodes for some reason but did not get a response from the master. This caused the "Master is currently unknown" error to be thrown. Looking at the code[7] for the lettuce client it looks like it only tries once to reach the master. If possible, you may want to see there is a way in your application to work around this so that minor problems with packet loss (which will happen with TCP networks) can be deal with. Or you may want ask for more details on the error on the lettuce issue tracker[8]. I see that the packet loss is still occurring on the master's underlying host so I recommend replacing the node[9].

Let me know if you have any additional questions.

At this time we had already recovered, however, per AWS comment that the master node was still experiencing packet loss I decided to failover the master node.

As soon as I failed over the master node we experienced the same issue (thankfully off-peak when dynamodb was able to handle the traffic without throttling requests). Although xxx-002 was promoted to master the logs still reported this node as a slave. It required another redeployment for the client to pickup the new master node.

I'm concerned that the client is unable to handle node failures without manual intervention. Is there something we need to configure differently?

Input Code

public AsyncPool<StatefulRedisMasterSlaveConnection<byte[], byte[]>> newPool() {
    Duration timeout = Duration.ofSeconds(connectionTimeout);
    List<RedisURI> uris = Stream.concat(
            Stream.of(new RedisURI(host, port, timeout)),
            // read shards
            zip(readHosts.stream(), readPorts.stream(), (host, port) -> new RedisURI(host, port, timeout)))
            .collect(toList());

    RedisClient client = RedisClient.create();

    return AsyncConnectionPoolSupport.createBoundedObjectPool(
            () -> MasterSlave.connectAsync(client, ByteArrayCodec.INSTANCE, uris), BoundedPoolConfig.builder()
        .maxIdle(getPoolMaxIdle())
        .minIdle(getPoolMinIdle())
        .maxTotal(getPoolMaxTotal())
        .build());
}

Expected behavior/code

The client should gracefully handle master node failover without restarting the client.

Environment

5.1.0.RELEASE AWS ElsastiCache 3 node multi-az cluster - Redis 3.2.4

mp911de commented 5 years ago

Thanks for reaching out. There are two issues from the report:

Master is currently unknown
Pool exhausted

Master is currently unknown

You're using a static Master/Slave topology for your setup. We consider it static because there's no Redis Sentinel or Redis Cluster manager that can supply runtime topology details. On creating a StatefulRedisMasterSlaveConnection, Lettuce connects to each host specified via RedisURI and obtains its role.

Lettuce reuses role definitions as long as the connection is not closed. This also means that Lettuce isn't refreshing topology while a connection is connected.

From the description above, the master node was not connectable so new connection requests weren't able to contact the master, and therefore their topology view did not contain a master

See:

"2019-03-26T04:24:00.327-0600","java.util.concurrent.CompletionException: io.lettuce.core.RedisException: Master is currently unknown: [RedisMasterSlaveNode [redisURI=RedisURI […], role=SLAVE], RedisMasterSlaveNode [redisURI=RedisURI […], role=SLAVE]]

Pool exhausted

This is an assumption: Because the master went dysfunctional, commands on the master didn't complete and connections weren't released to the pool. This report does not contain how you acquire/release a connection, but in most scenarios, you would release a connection back to the pool, once a RedisFuture completes.

To your question:

Is there something we need to configure differently?

I suggest using Redis Cluster or at least Redis Sentinel as these modes of operation recover from failures because of active topology propagation. Redis Cluster isn't always possible because of limitations in transactions and cross-slot keys in multi-key commands.

kurtharriger commented 5 years ago

Thanks.

I initially assumed that using a multinode cluster with auto-failover would mean we would not need to do anything if we lose the master node, but I see another related issue that seems to indicate this won't happen automatically. https://github.com/lettuce-io/lettuce-core/issues/338 and explains the failure during the manually initiated rollover.

The odd thing about the initial incident is that the AWS confirmed that the master node did not change and thus while the failure due to manual failover appears to require a restart, I'm unclear why if the configuration is static and the master did not change why it still required a service restart to recover?

Could you elaborate on what you mean by "using Redis Cluster" Are we not? aside from using async and a connection pool, our configuration I don't see that it is meaningfully different than Example 3 AWS ElastiCache Cluster described here https://github.com/lettuce-io/lettuce-core/wiki/Master-Slave

It appears from the AWS response their was packet loss so high connection latencies would be expected which is probably what exhausted the pool. Although some packet loss apparently persisted this didn't prevent the service from operating normally after we restarted the nodes via redeploy, so the initial packet loss problem was probably very short.

Our code to acquire and release connections is here https://bitbucket.org/snippets/atlassian/5enz7j I think you'll agree that the connection is released even in exceptional behaviour.

What isn't clear to me is how this code should be changed. Closing the connections would defeat the purpose of using the pool? should not the refreshing of connections, when needed, be handled by the pool?

mp911de commented 5 years ago

So there are a few setups where Redis and Cluster are used together. There are four "official" (meaning no external orchestration) modes of Redis operation:

Redis Standalone
Redis with a number of replica (slave) nodes (Lettuce: static Master/Slave)
Redis Sentinel managing a master with its replica (slave) nodes (Lettuce: Redis Sentinel, Master/Slave with Sentinel)
Redis Cluster (Lettuce: Redis Cluster)

Typically, we see people calling all except Redis Standalone a cluster. AWS provides two models:

ElastiCache with Replicas (Redis with a number of replicas)
ElastiCache with Cluster enabled (Redis Cluster)

Lettuce supports only the built-in Redis operation modes as each cloud provider tends to attach orchestration bits to their service and we do not want to buy into the complexity which goes beyond OSS Redis. Typically, when running Redis with a number of replicas, there is no automatic failover. The only failover that is possible requires intervention with Redis (reconfiguration of the Master/Slave) setup, therefore we call it static in Lettuce. Because there is also no standard mechanism that would notify Lettuce regarding a reconfiguration, there is no way to reconfigure these connections while your application is active. Your application would need to close all connections and reopen these.

mp911de commented 5 years ago

Closing as there's nothing left to do.

redis / lettuce