Re-establish connections in the case of Master/Slave failover

tomfitzhenry commented 8 years ago

The connection needs to be re-established outside of lettuce in a case of a Master/Slave failover or topology changes.

So that users don't have to implement this themselves, it'd be great if lettuce could do this transparently.

mp911de commented 8 years ago

Hi @tomfitzhenry, thanks for the ticket. Could you tell me more about your use case?

Static-Master-Slave connections are designed intentionally that way - AWS ElastiCache reports internal IP addresses and the connection point details need to be user-provided.

Regular Master-Slave connections could be self-discovering. That would lead to a crawler-like behavior in which lettuce could discover all members of the Master-Slave setup, but it comes with some challenges:

Any discovered node would live either forever or requires a time-to-live. Which one is the correct approach?
How to discover topology changes? Polling? Use connection events (i.e. connection disconnected)?

markrichardsextbbc commented 8 years ago

I work with Tom and put together a bit of code externally (calls shutdown and makes new client) to Lettuce to recreate the client when it seemed to need topology or node updates.

From tests using the workaround, it seems the following would be a great help if it could be included in Lettuce to help.

Let’s assume discovery is independent of this problem (maybe app will update nodes on query to AWS API), in this case, updating the nodes Lettuce is using would be great statefulRedisMasterSlaveConnection.updateNodes(redisUris).
Topology refresh:
1. Polling: regularly for disconnected nodes (e.g. Default 1/second) that the client doesn’t use until back and also for topology changes less regularly (e.g: Default 1/30 seconds), because there’s another way of identifying topology change…
2. Read/Master exceptions: saw exceptions about reading from master and writing to a slave, which might be good candidate for topology refresh, perhaps not on each discovery (if master is down, there will be nowhere to write to until it’s back or a node promoted) so perhaps rate limiting topology refreshes in this instance (e.g.: default 1/second) but if you think Redis can handle the extra check on per command in this situation, maybe refresh on each fail is okay

mp911de commented 8 years ago

Thanks for the detailed description. lettuce provides a similar facility for Redis Cluster (listening to events during operations; adaptive topology refresh).

I think it would make sense to expose the refresh trigger API and accept a custom TopologyProvider. The MasterSlave API is built on top of these components. Node details come from inside of the TopologyProvider and should not be set externally.

Users are able to build their own TopologyProvider and can provide RedisNodeDescriptions. The only point which requires a bit more thought is signaling demand for a topology refresh.

jaimebrolesi commented 6 years ago

@mp911de Hi Mark, Today, I'm trying to configure the Lettuce as Redis client with Master/Slave topology, but I did not find any good article talking about the configuration. I read the documentation and is not clear how can I re-establish the connection with AWS in a failover case, like... I need to switch the roles (master/slave) between the hosts using the static topology? How can I re-create the connection? There is some easy way to do this using Spring-Boot? Do you have some kind of an implementation? Thanks!

Follow my use case:

@Bean
    public RedisClient redisClient() {
        return RedisClient.create(DefaultClientResources
                .builder()
                .dnsResolver(new DirContextDnsResolver())
                .reconnectDelay(Delay.constant(Duration.ofSeconds(reconnectionDelay)))
                .build());
    }

    @Bean
    public StatefulRedisMasterSlaveConnection<String, String> redisConn(RedisClient redisClient) {
        RedisURI master = RedisURI.create("redis://****-001.****.****.****.amazonaws.com:6379");
        RedisURI slave = RedisURI.create("redis://****-002.****.****.****.amazonaws.com:6379");
        StatefulRedisMasterSlaveConnection<String, String> connect = MasterSlave.connect(
                redisClient,
                Utf8StringCodec.UTF8,
                Arrays.asList(master, slave));
        connect.setTimeout(Duration.ofSeconds(readTimeout));
        return connect;
    }

mp911de commented 6 years ago

Hey @jaimebrolesi the short answer is: There is nothing available.

Longer version: AWS Elasticache Master/Slave (and Master/Slave as known from Redis, without Sentinel) do not provide any details over topology changes. There's no possibility to discover that a failover (or reconfiguration) has happened. I'm not terribly familiar with AWS, maybe AWS provides events that can be captured in such case.

Because Master/Slave changes are typically an operational task that is performed outside of Redis, we made the assumption that these things don't happen when an application is running. Changing a Master/Slave setup is basically not constrained in any way, so we can't assume that the currently connected nodes will persist upon a change. Failovers/changes require an application restart to take effect of the new configuration.

jaimebrolesi commented 6 years ago

@mp911de I think it's possible because AWS give us a TOPIC with the Elasticache events. We can create a re-connection policy with INFO replication command based on failover event. This is the harder way I guess because Lettuce will need to carry AWS SDK for SQS/SNS consumption.

OR

We can define a new topology strategy for AWS, if any command timeout happens for 3 times (configurable), we can start a re-connection policy using the same INFO replication command strategy from MasterSlaveTopologyProvider on getNodes() method. This is possible because AWS give to us 2 kinds of connection a loadbalanced (DNS issues) and individual one (a hostname for each node)

What do you think?! I can help with AWS explanation or coding :) hehehe.

mp911de commented 6 years ago

I think we can increase visibility/provide an SPI to either trigger a refresh from outside or supply endpoint details so AWS-specific tooling can contribute to the topology/topology update. We will not introduce Cloud-specific (in this case AWS-specific) functionality to Lettuce that uses non-Redis infrastructure.

stuartharper commented 6 years ago

I'm investigating a similar issue to AWS elasticache but I'm also experiencing problems relating to DNS caching. In our use case it seems like hostname resolutions are cached forever regardless of the DNS resolver used because of the behaviour of SocketAddressResolver.

We're using lettuce 5.0.4 in Spring Boot 2.0.1

We setup our RedusConnectionFactory in the following way

  ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
                .enablePeriodicRefresh(Duration.ofSeconds(30))
                 .enableAllAdaptiveRefreshTriggers()
                .build();
        ClientOptions clientOptions = ClusterClientOptions.builder()
                .topologyRefreshOptions(topologyRefreshOptions)
                .build();
        ClientResources clientResources = DefaultClientResources.builder().dnsResolver(DnsResolvers.JVM_DEFAULT).build();

        RedisClusterConfiguration redisClusterConfiguration = new RedisClusterConfiguration(clusterNodes);
        LettuceClientConfiguration lettuceClientConfiguration = LettuceClientConfiguration.builder()
                .clientResources(clientResources)
                .clientOptions(clientOptions).build();

        return new LettuceConnectionFactory(redisClusterConfiguration, lettuceClientConfiguration);

(I've also tried not changing the DNS at all and using the default)

We're not using the static topology because we have a VPC connection into AWS and the member node IPs are accessible to us. Also if I understand DATAREDIS-580 and DATAREDIS-762 correctly it's currently not possible to use the static topology with spring data.

The above works fine except in the case where the IP address mapped to the hostname changes. This can be triggered manually by deleting and recreating the cluster with the same name but also the AWS docs explicitly warn that DNS mappings should not be cached and are prone to change. My experience with AWS services matches that.

The problem seems to be here in SocketAddressResolver https://github.com/lettuce-io/lettuce-core/blob/cf42638bcb1c4ba06ae68414bff0e484907475b1/src/main/java/io/lettuce/core/resource/SocketAddressResolver.java#L83

The DNS resolution is skipped if it's already been resolved.

It's very possible I'm missing somewhere but I don't see any way to configure the connection in a way that will resolve the DNS again upon reconnection.

Is there something we can do within the bounds of spring data to resolve this? Will StatefulRedisMasterSlaveConnection improve the situation?

mp911de commented 6 years ago

The mentioned line is used when configuring Lettuce to use Unix Domain Sockets. Then, we use local file resolution to resolve the file descriptor.

Using DnsResolvers.JVM_DEFAULT applies JVM caching rules. Using DnsResolvers.UNRESOLVED will fallback to netty's DNS resolution. You can also configure an own DNS resolver through DirContextDnsResolver. It's using the system-default DNS configuration, you can also configure an external DNS.

If this does not help, please file a new bug report along some details so we can have a closer look.

ldebello commented 5 years ago

guys, I am dealing with this same issue, after having a AWS failover I got an exception because I can write to read instances because the master/read was discover at the beginning, is there any config to allow reconnections in this exception case? do you think we can do something similar to issue 822 trying to reconnect in case of exception based on some configuration? Do you think useful to create a jira for this? or this is something which is not going to be added?

jaimebrolesi commented 5 years ago

Luis, Whenever a failover occurs your program will need 60 seconds (time that AWS needs to change de IP from ELB) to identify the change between the write and read machine. For some weird reason Java DNSResolver has problems to solve the IP change on AWS environment, for this reason Mark developed the DirContextDnsResolver. Everything you can do is change the resolver for reconnection and live with this 60 seconds of "writes" exception because like I sad is the time that AWS needs to change de IP.

Em sáb, 16 de fev de 2019 às 11:06, Luis De Bello notifications@github.com escreveu:

guys, I am dealing with this same issue, after having a AWS failover I got an exception because I can write to read instances because the master/read was discover at the beginning, is there any config to allow reconnections in this exception case? do you think we can do something similar to issue 822 trying to reconnect in case of exception based on some configuration? Do you think useful to create a jira for this? or this is something which is not going to be added?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/lettuce-io/lettuce-core/issues/338#issuecomment-464345472, or mute the thread https://github.com/notifications/unsubscribe-auth/AKKRtmCibMg21hWUTTedsJA6pEUvJQ39ks5vOAJMgaJpZM4JoUfv .

stuartharper commented 5 years ago

@ldebello I haven't had a chance to do a failover test on the newer versions but in 5.0.4 setting dynamicRefreshSources to false and enablePeriodicRefresh to true had the effect of rediscovering the entire cluster on the configured interval.

Usiel commented 5 years ago

Thanks for the detailed description. lettuce provides a similar facility for Redis Cluster (listening to events during operations; adaptive topology refresh).

I think it would make sense to expose the refresh trigger API and accept a custom TopologyProvider. The MasterSlave API is built on top of these components. Node details come from inside of the TopologyProvider and should not be set externally.

Users are able to build their own TopologyProvider and can provide RedisNodeDescriptions. The only point which requires a bit more thought is signaling demand for a topology refresh.

@mp911de Do you have any pointers on this, do you think this is still a good way to solve this issue? Wouldn't mind working on this.

mp911de commented 5 years ago

The issue requires some design and this is the hard part. Writing down the code is the easy part here.

anhldbk commented 5 years ago

@mp911de Is this issue resolved by https://github.com/lettuce-io/lettuce-core/issues/672?

mp911de commented 5 years ago

No. #672 is a Redis Cluster issue. This one is Master/Slave without Redis Cluster.

sguillope commented 5 years ago

Hi @mp911de, I've been trying to read quite a bit on this (in particular #1008 and this issue) as we're having a very similar issue and have been looking for a workaround. We're using:

ElastiCache non-clustered (1 primary, 1 replica)
Spring Boot 2.1.7
Spring Data Redis 2.1.10
Lettuce 5.1.8

We use the default Spring Boot Data Redis' autoconfiguration (with pooling) by providing host and port (so basically a RedisStandaloneConfiguration)

When a primary/replica failover occurs, where the primary doesn't die but just change role, the application is not able to recover from it. Existing connections are still connected to the former primary node and write commands fail with RedisCommandExecutionException - READONLY You can't write against a read only replica. until we restart the application.

We've been looking at ways to catch that particular exception and force the LettuceConnectionFactory to re-establish the connections but there doesn't seem to be a good way to do that.

This is the best we could find at the moment

enable testOnBorrow in GenericObjectPoolConfig Lettuce implements BasePooledObjectFactory's validateObject() and destroyObject() but they're never called since the Spring-created GenericObjectPoolConfig uses default values of false for testOnX fields. We were able to tweak it through a LettuceClientConfigurationBuilderCustomizer bean and set testOnBorrow to true
catch RedisCommandExecutionException and then
- call resetConnection() on the LettuceConnectionFactory bean
- call close() on the internal StatefulConnection

It looks something like this

try {
    // write to redis
} catch (RedisSystemException e) {
    if (e.getCause() instanceof RedisCommandExecutionException) {
        if (e.getCause().getMessage().startsWith("READONLY")) {
          final RedisConnection connection = connectionFactory.getConnection();
          connectionFactory.resetConnection();
          ((RedisAsyncCommands) connection.getNativeConnection()).getStatefulConnection().close();
        }
    }
}

After a few errors, eventually the connections are recreated and the application recovers.

Of course this is ugly as hell and most likely not intended at all. I'm also worried about unintended side-effects.

We're looking for guidance on how to best handle this scenario gracefully (manually restarting our dozen of instances is not an option)

charlesardsilva commented 4 years ago

Hi guys, Is there any solution for this case? @mp911de , can You share your code solution to connect with AWS Redis using master and slave? I need to configure using the same stack and didn't find good configurations using spring+lettuce.

ldebello commented 4 years ago

@charlesardsilva AWS has added master endpoint and reader endpoint so you could use those dns to solve the issue. We are using the following code:

@Bean
public RedisStaticMasterReplicaConfiguration redisStaticMasterReplicaConfiguration(
                @Value("${spring.redis.master-host:localhost}") String masterHost,
                @Value("${spring.redis.slave-host:localhost}") String slaveHost,
                @Value("${spring.redis.port:9991}") Integer port) {
    RedisStaticMasterReplicaConfiguration redisStaticMasterReplicaConfiguration =
        new RedisStaticMasterReplicaConfiguration(masterHost, port);
    redisStaticMasterReplicaConfiguration.addNode(slaveHost, port);
    return redisStaticMasterReplicaConfiguration;
}

@Bean
public LettuceConnectionFactory connectionFactory(
                RedisStaticMasterReplicaConfiguration redisStaticMasterReplicaConfiguration) {
    final SocketOptions socketOptions = SocketOptions.builder().connectTimeout(this.redisConnectTimeout).build();
    final ClientOptions clientOptions = ClientOptions.builder()
        .socketOptions(socketOptions)
        .autoReconnect(true)
        .build();

    LettuceClientConfiguration clientConfiguration = LettuceClientConfiguration.builder()
        .readFrom(ReadFrom.SLAVE_PREFERRED)
        .clientOptions(clientOptions)
        .commandTimeout(this.redisCommandTimeout)
        .build();
    return new LettuceConnectionFactory(redisStaticMasterReplicaConfiguration, clientConfiguration);
}

woowahankingbbode commented 4 years ago

@ldebello The reader endpoint round-robin routing works by changing the host that the DNS entry points to.

lettuce does not load balance slave nodes because it reuses connections.

Can you solve this problem?

ldebello commented 4 years ago

Currently we accepted the provide balance,it is not perfect but at least we can use master/replicas.

mp911de commented 4 years ago

@charlesardsilva

Is there any solution for this case

Nothing that we could do from a Lettuce-only perspective. Lettuce requires additional information about the new (changed) topology and that is something that needs to be provided externally.

We faced a few times the requirement to refresh the topology on demand. We could provide a reloadNodes(Iterable<RedisURI> hint) method that accepts a topology hint. Master/Replica uses various strategies for topologies:

Static Master/Replica using auto-discovery
Static Master/Replica using provided endpoints
Sentinel-managed Master/Replica

We need to figure out what a reloadNodes should do for each of these cases.

jaimebrolesi commented 4 years ago

@mp911de I don't know if I'm right but Lettuce uses the INFO command by default to identify who is the master and slave, right? So, I believe that is the problem. When you use the AWS ElastiCache, the IP from INFO command is an AWS internal IP. The real IP or VPC IP is unavailable on info command. For AWS the correct way to reload the nodes is to use the DNS informed on the Topology configuration.

mp911de commented 4 years ago

That is exactly why we have the

Static Master/Replica using provided endpoints

mode to take user-specified endpoint addresses. Lettuce figures out the roles from the given array of endpoints.

INFO replication is used for

Static Master/Replica using auto-discovery

which is intended mostly for on-premise setups.

We do not want to integrate with any sort of Cloud provider SDK as Lettuce is a Redis client, not a swiss army cloud knife.

For AWS the correct way to reload the nodes is to use the DNS informed on the Topology configuration.

Single-node failover already works this way as the hostname is resolved upon (re)connect.

brianwebb11 commented 4 years ago

Let me see if my understanding is correct...

AWS ElastiCache behavior

AWS provides DNS CNAMEs for

A) the current master
B) the load balancer over all replicas
C) one DNS entry for each individual Redis node (one for master and one for each replica)

AWS ElastiCache will automatically reconfigure the topology based on various events (individual node failure, manual master promotion, manually adding or removing a node in the AWS web console, etc). When any of these events occur, ElastiCache will make the changes to the topology and then update the DNS CNAME records when complete.

Example use case

Let's assume we have a long running process using Lettuce to interact with the ElastiCache topology. This process has both write requirements, consistent read requirements (read from master), and high volume read requirements where stale data is acceptable (e.g. ok to read from a replica). For a short-lived program we could probable get away with simply creating a "Static Master/Replica using provided endpoints" configuration based on DNS names for the master and replica load balancer. For a long-live program we need (1) the ability to detect a topology change event and (2) the ability to reconfigure our Lettuce client triggered by (1). Let's also assume we are not running Redis in a cluster topology, just a simple master/replica configuration. Let's also assume the client is running in AWS as well, so we can take advantage of DNS resolving to the non-public IP address for the Redis nodes.

Approach

For ElastiCache it seems like there are two possible approaches we can take.

1) Static Master/Replica using provided endpoints 2) Static Master/Replica using auto-discovery

Initialization For initialization of (1) I think we need to pass in a static list of the DNS CNAMEs for each individual Redis node (this is (C) from up above). For initialization of (2) we can simply pass in the DNS CNAME for the master and let Lettuce discover the topology (this is (A) from up above).

Topology change discovery It seems like for both (1) and (2) you need to either have some logic to periodically inspect the topology to see if it has changed since initialization or catch exceptions (e.g. reading from a node that is offline or writing to a node that was change from master to replica). My understanding is that it is straightforward to catch an exception, but the ability to inspect the topology to compare to the topology at initialization time does not exist in Lettuce. Is that right?

Lettuce client update Once a change is detected the Lettuce client needs to be updated. In a high volume, concurrent processing program where multiple threads may share Lettuce resources it can be tricky to update the client to the new topology while the underlying topology change itself (AWS performing the change) is not instantaneous. For a static redis topology it seems like you need to reinitialize a new client connection. There is no way to keep the existing Lettuce objects and tell it to re-initialize itself for a non-cluster topology. Is that right?

My understanding is that there is more advanced support in Lettuce for detecting topology changes and updating the Lettuce client accordingly for both Sentinel and Redis Cluster. For the simple Master/Replica configuration some of the Lettuce API's do not apply.

It seems reasonable for Lettuce to not build cloud-specific logic into Lettuce itself. At the same time, it seems the ElastiCache use case in non-cluster mode is a common one. Users want the ElastiCache update mechanism to play nicely with Lettuce in a concurrent processing system. At this point, it is not clear how to do that. There seems to be an impedance mismatch. Hence this open github issue.

An ideal outcome would be documentation and example code on the Lettuce wiki that demonstrates how to use Lettuce with ElastiCache for high availability failover in a non-cluster, master/replica configuration. This may or may not prompt enhancements in Lettuce itself.

woowahankingbbode commented 4 years ago

Let's go back to the first problem and talk about the response to failover

node 1-primary node 2-replica node 3-replica

after failover

node 1-replica node 2-primary node 3-replica

However, in the StaticMasterSlave configuration the known nodes of lettuce still know node 1 as primary.

This is because the StaticMasterSlave configuration does not support refreshing RedisNodeDescription after the initial connection.

Static Master / Slave runs a refresh only at the very beginning since there's no trigger that indicates a topology change (in contrast to Redis Sentinel).

This is understood.

However, I think it needs to be open so that the user can raise the event or act directly.

StaticMasterReplicaClientOptions

StaticMasterReplicaClientOptions.builder()
.topologyRefreshOptions(
    StaticMasterReplicaTopologyRefreshOptions.builder()
    .enablePeriodicRefresh(Duration.ofSeconds(30))
    .build()
)

or

StaticMasterReplicaRedisClient

void reloadNodes()

I also think that catching dynamically growing quantities is not something the StaticMasterSlave strategy will do.

However, the above strategy can achieve redistribution of static nodes, and the reconnect logic for a single node can work well.

FcoJavierSainz commented 3 years ago

It will be great if we can have an approach like this https://github.com/luin/ioredis#reconnect-on-error, even with Spring wrappers

vikramg25 commented 3 years ago

Hi Everyone, I am trying to write some code in my application to re-initialise the redis client after failover. Instead of catching the exception on read/write, Is there a way to intercept all read/write operations and re-initialise a new client connection on failure resulted from failover operation? Can someone suggest if there is a way to do it ?

mp911de commented 3 years ago

Right now, there's no method to apply a new topology to a StatefulRedisMasterReplicaConnection. The only thing possible is to re-obtain StatefulRedisMasterReplicaConnection.

vikramg25 commented 3 years ago

Hi Mark,

Thanks for quick response. Indeed I am trying to re-obtain the redis connection. But I am not in favor of obtaining the connection by surrounding the read/write operation with a try-catch block, as it needs code changes at every place when the read/write operation is invoked. Instead I want to intercept all read/write methods to achieve this by using spring aspects. Read/write operations are defined inside DefaultValueOperations class and I can't create a bean of it to implement the aspect as its access modifier is DEFAULT. Is there a way to intercept the read/write operations?

Thanks in advance.

Vikram

On Fri, Apr 9, 2021 at 3:50 PM Mark Paluch @.***> wrote:

Right now, there's no method to apply a new topology to a StatefulRedisMasterReplicaConnection. The only thing possible is to re-obtain StatefulRedisMasterReplicaConnection.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/lettuce-io/lettuce-core/issues/338#issuecomment-816582496, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARZ6CT3MKWTETTWT66XCJADTH3IGJANCNFSM4CNBI7XQ .

mp911de commented 3 years ago

Multi-node connections are always subject to mixed availability. Depending on what hosts are up/down, a connection may work for parts of the commands. There's no way to tell from the outside.

vikramg25 commented 3 years ago

Ok. Can we do something at a unified place instead of littering all over the code base with try-catch, so that the re-initialisation will be triggered on read/write failure? Any small clue or idea will be a great help. Thanks.

mp911de commented 3 years ago

Using Spring, enable LettuceConnectionFactory.validateConnection so Spring hands you out a connection where a PING succeeds.

ctl321 commented 3 years ago

Can you explain more how LettuceConnectionFactory.validateConnection can eliminate the need for a try/catch block on a read/write failure with a StaticMasterReplicaRedisClient?

I'm getting the following error on failovers RedisCommandExecutionException - READONLY You can't write against a read only replica.. Will enabling validateConnection fix this?

lotyrin commented 2 years ago

Ran into this -- somehow one of our app's replicas didn't catch the topology update coming from sentinel or didn't respond to it correctly. Ideally there would be a nice easy way to, if we encounter readonly exceptions, trigger a topology update and recover.

razorree commented 1 year ago

Is this issue being solved? I've just enountered such problem with ElastiCache, one replica was promoted to master (3 shards, master+2 replicas), and App couldn't write (PUT,DEL operations were failing) for 15 minutes.. Caused by: io.lettuce.core.RedisCommandTimeoutException: Command timed out after 2 second(s)

Problem was solved by redeploying the App (thus redis client was reinitialised with a new topology).

How to mitigate that problem ?

HalfWeight commented 1 year ago

Hi all, any news on this?

I have a similar problem. I configured my spring-boot application to use the lettuce client. I have configured 6 Redis nodes on Elasticache (3 masters + 3 slaves). When a master goes down (to test a failover), the application stop working because it continues to try to connect to the old master, and the application gets a connection timeout.

Following the configuration:

spring data: redis: database: 1 timeout: 10000 cluster: nodes: master01:6379,slave01:6379,master02:6379,slave02:6379,master03:6379,slave03:6379 max-redirects: 6 ssl: true lettuce: pool: max-idle: 8 min-idle: 0 max-active: 8 enabled: true cluster: refresh: dynamic-refresh-sources: true adaptive: true period: PT1S client-type: lettuce

redis / lettuce