redis / lettuce

Advanced Java Redis client for thread-safe sync, async, and reactive usage. Supports Cluster, Sentinel, Pipelining, and codecs.
https://lettuce.io
MIT License
5.38k stars 965 forks source link

Issue with initial connection from Redis Cluster when command timeout is small #1612

Closed bahui80 closed 3 years ago

bahui80 commented 3 years ago

Description

I am implementing a rate limiter library using a Redis cluster. The main idea is to make a distributed rate limit across instances. Since we will be handling lots of requests and our system nowadays is response time sensitive, we want to fail fast if Redis is down somehow (connection lost, any error, etc).

So what we want basically is have a pretty small command timeout and we dont care about the connection timeout since the initial connection when the app starts can take long.

So our code basically does the rate limiting inside a big

try{
    rate limit logic
} catch (Exception e) {
    log and continue
} 

The main goal of this is that if redis fails or any of the rate limit logic fails, we let the request pass. We prefer to let more requests pass than failing or slowing requests since this service is pretty response time sensitive and crutial so we cant fail requests (but we still need to protect ourselves).

Issue

Now the issue is that when we set a really small command timeout lettuce fails to connect to Redis. This is our current configuration for the Redis connection (all Beans in Spring Boot):

Client Resources

@Bean(destroyMethod = "shutdown")
public ClientResources clientResources() {
    return DefaultClientResources.create();
}

Cluster Client Options

@Bean
public ClusterClientOptions createRedisClusterClientOptions() {
    final SocketOptions socketOptions = SocketOptions.builder()
            .connectTimeout(redisConfigurationProperties.getConnectionTimeout())
            .build();

    ClusterClientOptions.Builder clientOptions = ClusterClientOptions.builder()
            .autoReconnect(true)
            .disconnectedBehavior(ClientOptions.DisconnectedBehavior.REJECT_COMMANDS)
            .cancelCommandsOnReconnectFailure(true)
            .socketOptions(socketOptions)
            .topologyRefreshOptions(ClusterTopologyRefreshOptions.builder()
                    .dynamicRefreshSources(false)
                    .enablePeriodicRefresh()
                    .enableAllAdaptiveRefreshTriggers()
                    .closeStaleConnections(true)
                    .build());

    if (redisConfigurationProperties.getUseSsl()) {
        clientOptions.sslOptions(SslOptions.builder()
                .jdkSslProvider()
                .build());
    }

    return clientOptions.build();
}

Redis Cluster Client

@Bean(destroyMethod = "shutdown")
public RedisClusterClient redisClient(ClientResources clientResources, ClusterClientOptions clusterClientOptions) throws UnknownHostException {
    RedisURI redisUri = createRedisUri();

    RedisClusterClient redisClusterClient = RedisClusterClient.create(clientResources, redisUri);
    redisClusterClient.setOptions(clusterClientOptions);

    return redisClusterClient;
}

private RedisURI createRedisUri() throws UnknownHostException {
    String host = redisConfigurationProperties.getHost();
    int port = redisConfigurationProperties.getPort();
    boolean useSsl = redisConfigurationProperties.getUseSsl();

    LOGGER.info("Creating a new Redis connection to {}:{} with ssl set to {}", host, port, useSsl);
    return RedisURI.builder()
            .withHost(host)
            .withPort(port)
            .withSsl(useSsl)
            .withTimeout(redisConfigurationProperties.getCommandTimeout())
            .withPassword(redisConfigurationProperties.getPassword().toCharArray())
            .withClientName(InetAddress.getLocalHost().getHostName())
            .build();
}

There are some configs here. Those configs are spring boot configs. But from here two are quite important:

At first we had 1.5 seconds and 3 seconds and everything was working (it was for development only). But now that we want to go to prod and move the command timeout to 100 milliseconds the app is not coming up.

And we clearly see that the initial connection fails:

Caused by: org.springframework.beans.BeanInstantiationException: Failed to instantiate [io.lettuce.core.cluster.api.StatefulRedisClusterConnection]: Factory method 'createRedisConnection' threw exception; nested exception is io.lettuce.core.RedisConnectionException: Unable to establish a connection to Redis Cluster
    at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:189)
    at org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:588)
    ... 66 common frames omitted
Caused by: io.lettuce.core.RedisConnectionException: Unable to establish a connection to Redis Cluster
    at io.lettuce.core.cluster.RedisClusterClient.lambda$assertInitialPartitions$32(RedisClusterClient.java:867)
    at io.lettuce.core.cluster.RedisClusterClient.get(RedisClusterClient.java:886)
    at io.lettuce.core.cluster.RedisClusterClient.assertInitialPartitions(RedisClusterClient.java:866)
    at io.lettuce.core.cluster.RedisClusterClient.connect(RedisClusterClient.java:389)
    at io.lettuce.core.cluster.RedisClusterClient.connect(RedisClusterClient.java:366)
    at com.testing.runtime.services.redis.autoconfigure.RedisAutoConfiguration.createRedisConnection(RedisAutoConfiguration.java:118)
    at com.testing.runtime.services.redis.autoconfigure.RedisAutoConfiguration$$EnhancerBySpringCGLIB$$2eb9064.CGLIB$createRedisConnection$3(<generated>)
    at com.testing.runtime.services.redis.autoconfigure.RedisAutoConfiguration$$EnhancerBySpringCGLIB$$2eb9064$$FastClassBySpringCGLIB$$aad88165.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:228)
    at org.springframework.context.annotation.ConfigurationClassEnhancer$BeanMethodInterceptor.intercept(ConfigurationClassEnhancer.java:358)
    at com.testing.runtime.services.redis.autoconfigure.RedisAutoConfiguration$$EnhancerBySpringCGLIB$$2eb9064.createRedisConnection(<generated>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:162)
    ... 67 common frames omitted
Caused by: io.lettuce.core.cluster.topology.DefaultClusterTopologyRefresh$CannotRetrieveClusterPartitions: Cannot retrieve cluster partitions from [rediss://****************@someAmazonUrl.use1.cache.amazonaws.com:6379?clientName=someClientName&timeout=100000000ns]

Details:
    [rediss://****************@someAmazonUrl.use1.cache.amazonaws.com:6379?clientName=someClientName&timeout=100000000ns]: java.nio.channels.ClosedChannelException

    Suppressed: io.lettuce.core.RedisConnectionException: Unable to connect to [someAmazonUrl.use1.cache.amazonaws.com:6379]: java.nio.channels.ClosedChannelException
        at io.lettuce.core.cluster.topology.DefaultClusterTopologyRefresh.lambda$openConnections$11(DefaultClusterTopologyRefresh.java:339)

I obfuscated urls and client names.

Is it possible to set such a smaller commandTimeout and succeed in initial connection? Also, why command timeout is used for the initial connection? I assume it should only be using the connect timeout from the socket options

mp911de commented 3 years ago

You can get more control by configuring a TimeoutSource (via TimeoutOptions and ClientOptions). That interface is used to determine the per-command timeout. You can use a trigger or specific commands to give these a higher timeout (like the CLUSTER and CLIENT commands) and default to 100ms for all other ones.

bahui80 commented 3 years ago

Hi thanks for the answer. Yes I read about the TimeoutOptions but what i don't understand with my current configuration is what the timeout in the RedisURI is used for then:.withTimeout(redisConfigurationProperties.getCommandTimeout()) now.

And also the .connectTimeout(redisConfigurationProperties.getConnectionTimeout()) in the SocketOptions.

Is my problem now that for the initial connection a command is issued to retrieve the cluster partitions (so the connection itself triggers commands for connecting)? And since my command timeout is 100ms for every command, they are two small to retrieve cluster partitions?

If I specify the TimeoutSource, then I should remove those two (socket timeout and command timeout on redis uri) and handle all timeouts logic via there?

mp911de commented 3 years ago

RedisURI.timeout was first and that timeout was used for all sorts of cases (connect timeout, command timeout, SSL handshake timeout, …). Without configuring TimeoutOptions, RedisURI.timeout applies to commands.

bahui80 commented 3 years ago

Ok great thanks for the explanation. So this statement is true right: Is my problem now that for the initial connection a command is issued to retrieve the cluster partitions (so the connection itself triggers commands for connecting)? And since my command timeout is 100ms for every command, they are two small to retrieve cluster partitions?

Initial connection triggers some commands to be able to connect.

mp911de commented 3 years ago

The stack trace is a bit short and I didn't get yet a chance to reproduce the issue. Maybe I'll find some time next week or so. If I had to guess, I would say it's the PING or HELLO command that verifies the connection handshake.

bahui80 commented 3 years ago

Ok great, just in case, I did remove the pingBeforeActivateConnection config, so currently it is set to false.

I removed that one in fact because by using that one (prior understanding what you just explained), having a small command timeout make the initial ping fail almost every time. That was why I removed the ping option

bahui80 commented 3 years ago

Ok I did a little bit more digging into your proposal. I added to my ClusterClientOptions object the following:

.timeoutOptions(TimeoutOptions.builder()
                        .timeoutSource(new Saraza())
                        .build());

That class is defined as following:

public class Saraza extends TimeoutOptions.TimeoutSource {
    @Override
    public long getTimeout(RedisCommand<?, ?, ?> command) {
        if (command.getType().equals(CommandType.CLIENT) || command.getType().equals(CommandType.CLUSTER) || command.getType().equals(CommandType.INFO)) {
            return 60000;
        }
        return redisConfigurationProperties.getCommandTimeout().toMillis();
    }
}

So also with some integration tests I have (that use TestContainer to run Redis cluster), i debugged them and noticed that the three commands called during the connection were: CLIENT, CLUSTER and INFO.

I set my redisConfigurationProperties.getCommandTimeout() to 10 milliseconds on purpouse to make my tests fail.

Even though I added the timeoutOptions and specified that the initial connection commands have a longer command timeout, it is still failing to connect. I can share the full stack trace:

java.lang.IllegalStateException: Failed to load ApplicationContext

    at org.springframework.test.context.cache.DefaultCacheAwareContextLoaderDelegate.loadContext(DefaultCacheAwareContextLoaderDelegate.java:124)
    at org.springframework.test.context.support.DefaultTestContext.getApplicationContext(DefaultTestContext.java:83)
    at org.springframework.test.context.support.DependencyInjectionTestExecutionListener.injectDependencies(DependencyInjectionTestExecutionListener.java:117)
    at org.springframework.test.context.support.DependencyInjectionTestExecutionListener.prepareTestInstance(DependencyInjectionTestExecutionListener.java:83)
    at org.springframework.boot.test.autoconfigure.SpringBootDependencyInjectionTestExecutionListener.prepareTestInstance(SpringBootDependencyInjectionTestExecutionListener.java:43)
    at org.springframework.test.context.TestContextManager.prepareTestInstance(TestContextManager.java:230)
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.createTest(SpringJUnit4ClassRunner.java:228)
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner$1.runReflectiveCall(SpringJUnit4ClassRunner.java:287)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.methodBlock(SpringJUnit4ClassRunner.java:289)
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:247)
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.runChild(SpringJUnit4ClassRunner.java:94)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
    at org.springframework.test.context.junit4.statements.RunBeforeTestClassCallbacks.evaluate(RunBeforeTestClassCallbacks.java:61)
    at org.springframework.test.context.junit4.statements.RunAfterTestClassCallbacks.evaluate(RunAfterTestClassCallbacks.java:70)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
    at org.springframework.test.context.junit4.SpringJUnit4ClassRunner.run(SpringJUnit4ClassRunner.java:191)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
    at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
    at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:220)
    at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:53)
Caused by: org.springframework.beans.factory.BeanCreationException: Error creating bean with name 'createRedisConnection' defined in class path resource [com/testing/runtime/services/redis/autoconfigure/RedisAutoConfiguration.class]: Bean instantiation via factory method failed; nested exception is org.springframework.beans.BeanInstantiationException: Failed to instantiate [io.lettuce.core.cluster.api.StatefulRedisClusterConnection]: Factory method 'createRedisConnection' threw exception; nested exception is io.lettuce.core.RedisConnectionException: Unable to establish a connection to Redis Cluster
    at org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:599)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.instantiateUsingFactoryMethod(AbstractAutowireCapableBeanFactory.java:1177)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBeanInstance(AbstractAutowireCapableBeanFactory.java:1072)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:511)
    at org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:481)
    at org.springframework.beans.factory.support.AbstractBeanFactory$1.getObject(AbstractBeanFactory.java:312)
    at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:230)
    at org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:308)
    at org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:197)
    at org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:756)
    at org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:867)
    at org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:542)
    at org.springframework.boot.context.embedded.EmbeddedWebApplicationContext.refresh(EmbeddedWebApplicationContext.java:123)
    at org.springframework.boot.SpringApplication.refresh(SpringApplication.java:666)
    at org.springframework.boot.SpringApplication.refreshContext(SpringApplication.java:353)
    at org.springframework.boot.SpringApplication.run(SpringApplication.java:300)
    at org.springframework.boot.test.context.SpringBootContextLoader.loadContext(SpringBootContextLoader.java:118)
    at org.springframework.test.context.cache.DefaultCacheAwareContextLoaderDelegate.loadContextInternal(DefaultCacheAwareContextLoaderDelegate.java:98)
    at org.springframework.test.context.cache.DefaultCacheAwareContextLoaderDelegate.loadContext(DefaultCacheAwareContextLoaderDelegate.java:116)
    ... 25 more
Caused by: org.springframework.beans.BeanInstantiationException: Failed to instantiate [io.lettuce.core.cluster.api.StatefulRedisClusterConnection]: Factory method 'createRedisConnection' threw exception; nested exception is io.lettuce.core.RedisConnectionException: Unable to establish a connection to Redis Cluster
    at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:189)
    at org.springframework.beans.factory.support.ConstructorResolver.instantiateUsingFactoryMethod(ConstructorResolver.java:588)
    ... 43 more
Caused by: io.lettuce.core.RedisConnectionException: Unable to establish a connection to Redis Cluster
    at io.lettuce.core.cluster.RedisClusterClient.lambda$assertInitialPartitions$32(RedisClusterClient.java:867)
    at io.lettuce.core.cluster.RedisClusterClient.get(RedisClusterClient.java:886)
    at io.lettuce.core.cluster.RedisClusterClient.assertInitialPartitions(RedisClusterClient.java:866)
    at io.lettuce.core.cluster.RedisClusterClient.connect(RedisClusterClient.java:389)
    at io.lettuce.core.cluster.RedisClusterClient.connect(RedisClusterClient.java:366)
    at com.testing.runtime.services.redis.autoconfigure.RedisAutoConfiguration.createRedisConnection(RedisAutoConfiguration.java:124)
    at com.testing.runtime.services.redis.autoconfigure.RedisAutoConfiguration$$EnhancerBySpringCGLIB$$c4ae6dbc.CGLIB$createRedisConnection$3(<generated>)
    at com.testing.runtime.services.redis.autoconfigure.RedisAutoConfiguration$$EnhancerBySpringCGLIB$$c4ae6dbc$$FastClassBySpringCGLIB$$72fa88e3.invoke(<generated>)
    at org.springframework.cglib.proxy.MethodProxy.invokeSuper(MethodProxy.java:228)
    at org.springframework.context.annotation.ConfigurationClassEnhancer$BeanMethodInterceptor.intercept(ConfigurationClassEnhancer.java:358)
    at com.testing.runtime.services.redis.autoconfigure.RedisAutoConfiguration$$EnhancerBySpringCGLIB$$c4ae6dbc.createRedisConnection(<generated>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.beans.factory.support.SimpleInstantiationStrategy.instantiate(SimpleInstantiationStrategy.java:162)
    ... 44 more
Caused by: io.lettuce.core.cluster.topology.DefaultClusterTopologyRefresh$CannotRetrieveClusterPartitions: Cannot retrieve cluster partitions from [redis://************************************@localhost:57322?clientName=myClientName&timeout=10000000ns]

Details:
    [redis://************************************@localhost:57322?clientName=myClientName&timeout=10000000ns]: Command timed out after 10 millisecond(s)

    Suppressed: io.lettuce.core.RedisCommandTimeoutException: Command timed out after 10 millisecond(s)
        at io.lettuce.core.internal.ExceptionFactory.createTimeoutException(ExceptionFactory.java:53)
        at io.lettuce.core.cluster.topology.Connections.lambda$doRequest$2(Connections.java:146)
        at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
        at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
        at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)

For local tests I am not using SSL so its a simple connection.

mp911de commented 3 years ago

Since there isn't anything we can do here, closing the ticket. Setting timeouts that lead to handshake timeouts isn't something that we can address.