Constantly Reestablishing Connections to AWS ElastiCache Redis in Cluster Mode (Continued)

ianjhoffman commented 2 years ago

(See previous issue for additional context and discussion)

Expected Behavior

A ClusterClient with bare-minimum settings for connecting to an AWS ElastiCache Redis cluster (from here on referred to as ECRedis) will, at no point, contribute to a scenario in which it is constantly reestablishing connections to an ECRedis cluster node.

Example ClusterClient config:

redis.NewClusterClient(&redis.ClusterOptions{
    Addrs: []string{redisAddr},
    TLSConfig: &tls.Config{},
})

Current Behavior

(The Current Behavior section in the original issue is still accurate to this issue)

Occasionally, we see connections being constantly re-established to one of our ECRedis cluster nodes at the limit of how many new connections are possible (~15k/minute is the reported rate). Redis nodes are essentially single-threaded and negotiating TLS for new connections takes up 100% of this node's CPU, preventing it from doing any other work. The time at which this issue occurs seems random, and we cannot correlate it to:

Amount of load on the system (# Redis commands)
Events happening on the ECRedis cluster (resharding, cycling out nodes, failovers, etc.)
Any other issues with the ECRedis cluster not normally visible in the AWS Console (we consulted with AWS support for this one)
Service restarts for our Go service that communicates with ECRedis

When this issue happens, running CLIENT LIST on the affected Redis node shows age=0 or age=1 for all connections every time, which reinforces that connections are being dropped constantly for some reason. New connections plummet on other shards in the Redis cluster, and are all concentrated on one.

After discussion on the previous issue we are no longer the only ones experiencing this unknown problem, and it is blocking us from further relying on this service we've been building (and delaying full use of) for well over 6 months now.

Possible Solution

Redis ClusterClient should react more gracefully to that quickly devolves into constantly reestablishing connections to some node in the AWS ElastiCache Redis cluster.

After the previous issue, we have tried a variety of approaches to mitigate this problem, none of which have solved it entirely:

Increasing all timeouts in context.Contexts passed to Redis commands to 1 second or greater
Reducing # of goroutines exercising our Redis claim logic
Lowering connection pool size
Moving all operations out of MULTI/EXEC pipelines
Completely disabling idle connection reaping
Fuzzing the interval at which each individual cluster node's Client cleans idle connections
Changing the instance type of the ElastiCache Redis nodes
Rebalancing/increasing shard count for the ElastiCache Redis cluster

We are seeing the issue a lot less often after

Reducing # of goroutines exercising our Redis claim logic
Lowering pool size
Fuzzing the interval at which each individual cluster node's Client cleans idle connections

but are not confident that these changes constitute a permanent fix and are instead a consequence of there being fewer potential resources to overload an individual ECRedis cluster node.

Details About Fuzzing Idle Connection Cleanup

We opted to set IdleTimeout and IdleCheckFrequency on the redis.ClusterOptions passed to our client to -1 and added a custom NewClient implementation:

// A new redis.Client is created for each Redis cluster node to which a redis.ClusterClient
// sends commands. Make each redis.Client clean idle connections at a different cadence.
redisOpts.NewClient = func(clientOptions *redis.Options) *redis.Client {
    rand.Seed(time.Now().UnixNano())
    idleTimeoutFuzz := time.Duration(rand.Int63n(int64(idleTimeoutFuzzRange)))

    // It's safe to modify this *redis.Options in this way - a new data
    // structure is created and passed to this function for each client created
    clientOptions.IdleTimeout = baseIdleTimeout + idleTimeoutFuzz

    return redis.NewClient(clientOptions)
}

Resetting the System to a Good State

When this issue happens, we are generally able to reset it to a good state by telling producers to stop producing new work to the system, which in turn means our consumer goroutines running the algorithm detailed later in this issue stop seeing new work items and their algorithm essentially just becomes SRANDMEMBER pending 10 -> sleep 10ms -> repeat. We can then resume producer writes and the system goes back into a good state without connections being constantly reestablished. This is an improvement from our solution in the previous issue: completely stopping all services and then turning them back on after some period of time. However, it's not a tenable solution to this issue if we are to continue onboarding more work to the system in production.

Steps to Reproduce

The description of our environment/service implementation below, as well as the snippet of our NewClusterClient call at the beginning of this issue, provide a fairly complete summary of how we're using both go-redis and ECRedis. We've not been able to consistently trigger this issue since it often happens when we're not load testing. I'd be interested to know if the other commenters on the original issue, @klakekent and @pedrocunha, have had any luck reproducing the issue. It sounds like it's been difficult to reproduce for them as well.

Context (Environment)

We're running a service that has a simple algorithm for claiming work from a Redis set, doing something (unrelated to Redis) with it, and then cleaning it up from Redis. In a nutshell, the algorithm is as follows:

SRANDMEMBER pending 10 - grab up to 10 random items from the pool of available work
ZADD in_progress <current_timestamp> <grabbed_item> for each of our items we got in the previous step
Any work items we weren't able to ZADD have been claimed by some other instance of the service, skip them
Once we're done with a work item, SREM pending <grabbed_item>
Periodically ZREMRANGEBYSCORE in_progress -inf <5_seconds_ago> so that claimed items aren't claimed forever, but items can only be claimed exactly once during a 5 second window

Work item producers simply SADD pending <work_item> but the load they produce is miniscule compared to the consumers running the algorithm outlined above.

Currently we run this algorithm on 22 EC2 instances, each running one service. We've configured our ClusterClient to use a PoolSize of 10 (reduced from the default of 20 on these instances) for each child Client it creates. Each service has 25 goroutines performing this algorithm, and each goroutine sleeps 10ms between each invocation of the algorithm.

At a steady state with no load on the system (just a handful of heartbeat jobs being added to pending every minute) we see a maximum of ~8% EngineCPUUtilization on each Redis shard, and 1-5 new connections/minute. Overall, pretty relaxed. When this issue has triggered recently, it's happened from this steady state, not during load tests.

Our service is running on EC2 instances running Ubuntu 18.04 (Bionic), and we have tried using github.com/go-redis/redis/v8 v8.0.0, github.com/go-redis/redis/v8 v8.11.2, and github.com/go-redis/redis/v8 v8.11.4 - all 3 have run into this issue. We are now using a fork of v8.11.4 (see added commits here) with the following changes (some suggested in the previous issue, some thought of by us):

Add logs in spots where bad connections are removed
Add hooks in ClusterClient.process method for logging/metrics around various anomalies during processing commands (MOVED/ASK responses, errors getting nodes, node failures). We do not see these while the issue is happening.

We're currently running with a 24-shard ElastiCache Redis cluster with TLS enabled, where each shard is a primary/replica pair of cache.r6g.large instances.

Detailed Description

N/A

Possible Implementation

N/A

vmihailenco commented 2 years ago

@enjmusic thanks for creating the isssue.

Add logs in spots where bad connections are removed

Do you see any logs? I guess not which may indicate that the idle conns reaper is indeed the problem. You could try to disable it completely to confirm that. Or add some logging here and here if you really need IdleTimeout.

ianjhoffman commented 2 years ago

@vmihailenco if I remember correctly, we were seeing a mixture of I/O timeout and context deadline exceeded logs when this issue was triggered.

We did completely disable idle connection reaping for a while (both the reaper goroutine and the IdleTimeout) and while this made the issue happen less often, we did still see it triggered once! So while it may be a contributing factor I don't know if it's the root cause. We haven't seen this issue lately after:

Reducing # of goroutines exercising our Redis claim logic
Lowering pool size
Fuzzing the interval at which each individual cluster node's Client cleans idle connections

but again, it's difficult to believe that we've completely addressed the issue or just made it exceedingly unlikely at our current scale. I'd be interested to know if we can get any more insights from @klakekent and @pedrocunha, assuming they're still seeing the issue periodically.

vmihailenco commented 2 years ago

I/O timeout and context deadline

Well, it may be an indication that you need a larger server / better network. There is no evidence it is a bug in go-redis.

ianjhoffman commented 2 years ago

Well, it may be an indication that you need a larger server / better network. There is no evidence it is a bug in go-redis.

Our web servers running go-redis code, as well as our ElastiCache Redis cluster, are both overprovisioned for the amount of work they're doing. They're always pretty bored when this issue is triggered. Given that:

we have a ton of CPU/memory headroom
multiple others have reported the same issue
we use go-redis in non cluster mode in other applications with similar or greater load per node and never observed this issue over 2+ years in those contexts
@klakekent, like us, was told by AWS support that there was nothing suspicious going on from AWS's view of the ElastiCache Redis clusters

With these points in mind, I think it's very reasonable to consider them enough evidence to continue taking a closer look at go-redis. I would love to see more engagement around getting to the bottom of this issue rather than dismissing it a second time.

monkey92t commented 2 years ago

According to your log report closing bad conn: context deadline exceeded, I checked the working process of go-redis in detail, it only appears in initConn and withConn, I believe this error is caused by context timeout .

But according to your reply:

we use go-redis in non cluster mode in other applications with similar or greater load per node and never observed this issue over 2+ years in those contexts

You did not encounter this problem in non-redis cluster mode, so what is your context timeout in non-redis cluster mode? Is an ACL operation required(initConn)?

skip initConn:

if c.opt.Password == "" && c.opt.DB == 0 && !c.opt.readOnly && c.opt.OnConnect == nil {
        // skip initConn
}

monkey92t commented 2 years ago

go-redis ClusterClient additionally caches the results of cluster slots, but it is executed asynchronously (except when each node is initialized), but if your redis-cluster nodes change frequently, it may greatly increase the command execution time (ask moved).

In addition, you may need to consider network issues (aws vpc), or the time spent processing commands by redis-cluster(> context.Timeout).

ianjhoffman commented 2 years ago

So, we have a theory about what might be going on. AWS ElastiCache Redis best practices dictate that one should use the "Cluster Configuration Endpoint" for cluster discovery, which we took to mean that it was ok to use that as our one seed address for the ClusterClient, assuming that every time it was reached out to it would route to a different node in the cluster. The auto discovery section of the ElastiCache FAQ states that

the Amazon ElastiCache cluster is also given a unique Configuration Endpoint which is a DNS Record that is valid for the lifetime of the cluster. This DNS Record contains the DNS Names of the nodes that belong to the cluster. Amazon ElastiCache will ensure that the Configuration Endpoint always points to at least one such “target” node. A query to the target node then returns endpoints for all the nodes of the cluster in question

which we took to mean that it would balance between all nodes. However, there's likely some DNS caching going on. When I nslookup our Configuration Endpoint I see the full list of IPs for all primaries and replicas, but when I openssl s_client -nocommands -connect <configuration_endpoint> I always end up on the same node in the cluster. I think this means that when the ClusterClient is seeded with just the Configuration Endpoint, in any situation where several goroutines executing code in cluster.go all need to fall back on the initial Addrs list provided when the ClusterClient was created, they'll all hammer the same shard.

We may be able to get around this issue by performing a LookupIP or similar query to get our Addrs list so that the initial list of nodes lends itself well to a fair division of lookups across nodes when needed.

We're probably going to implement this safety measure no matter what, but my main question after looking into the Configuration Endpoint (and its use as the only seed address for a ClusterClient) a little more is: what are the different spots in cluster.go where we fall back on the initially provided Addrs list from the ClusterOptions? I started looking deeply into the code for this today but figured I'd miss some stuff given that there's a lot of code for clusterNodes and clusterStateHolder.

monkey92t commented 2 years ago

There seems to be no evidence of a DNS problem, in the logs you see "context deadline exceeded", we should determine where more time is being consumed causing the context to time out.

We were unable to reproduce this issue in a test environment, you can add more logs to go-redis to locate where time is consuming:

https://github.com/go-redis/redis/blob/40e84a79bfb4fafc2d7b6770f7434311de08ce48/cluster.go#L771

dzqv commented 2 years ago

We have a theory about what might be going on... The auto discovery section...

@enjmusic Have you seen this post https://medium.com/tinder-engineering/taming-elasticache-with-auto-discovery-at-scale-dc5e7c4c9ad0 ?

Tinder found a bug in their Java client failing to dynamically reconfigure which IP it was connecting to following AWS Elasticache cluster maintenance events.

ianjhoffman commented 2 years ago

We finally saw this issue again during the night yesterday, but it affected eight different nodes, which we've never seen before. Previously it was always one. This may or may not disprove the cluster configuration endpoint issue. Again, temporarily stopping new ingress of jobs into Redis resolved the issue.

ianjhoffman commented 2 years ago

@dzqv just read the blog post! It's an interesting study but doesn't look like our issue where connection floods occur to some nodes independent of any cluster maintenance events

ianjhoffman commented 2 years ago

This time we did see the issue happen within a couple minutes of a

Certificate renewed for replication group <REDACTED> on Sun Mar 20 08:14:05 UTC 2022

... curious about whether or not that could put clients into a bad state

pedrocunha commented 2 years ago

We still don't know what caused this but a couple of things:

This only happened for a particular service that uses context timeouts
We have not seen this issue re-occur since we upgraded to Redis 6.2
On the online upgrade to 6.2 there are multiple failover events and we didn't experience any issues in regards to new connections
Our limiter callback seems to trigger frequently but we aren't sure if this is "saving" us

allanbenW commented 2 years ago

Just wanna add that we are also seeing this issue. We are using clustered Elasticache Redis 6.0.5 on r6g instances with TLS enabled, our clients are running on EC2 as well.

We were able to greatly mitigate the issue by passing empty context to go-redis, instead of the context with deadline/timeout.

A few other params we also tuned to mitigate this issue:

lowering poolSize, and also make number of connections stable by setting PoolSize = MinIdleConns
disabling idle conn reap
add random jitter in dialer
stick to higher timeouts, set dialTimeout = 5s; readTimeout = writeTimeout = PoolTimeout = 1s
disabling retry by setting MaxRetries to -1

The changes we made are helpful to reduce the symptom, but it's still happening from time to time. Almost always it started with read timeouts on a single node, and then a huge TLS connection spike to that node. It issue happens and self-resolves within a minutes

@pedrocunha Are there any re-occurrences on Elasticache Redis 6.2 since you posted? https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/supported-engine-versions.html#redis-version-6.2 Their 6.2 version added multi-cpu optimization for TLS connections, so maybe that's helping.

ianjhoffman commented 2 years ago

Today we saw this issue recur despite our Redis cluster being on

Engine Version Compatibility: 6.2.5

Not sure if that means the nodes are actually running 6.2 or just something to do with which commands work.

That said, we now have a circuit breaker implemented in our writers that stops writes when errors are detected for given keys, which has helped with autoremediation.

ianjhoffman commented 2 years ago

We had this new logging in place on our go-redis fork for cluster state issues and the like, but didn't see any of it trigger during the issue.

ianjhoffman commented 2 years ago

Actually, our nodes are cache.r6g.large right now which means they wouldn't reap the benefits of multi-cpu optimization for TLS connections!

vmihailenco commented 2 years ago

We were able to greatly mitigate the issue by passing empty context to go-redis, instead of the context with deadline/timeout.

This article should explain why that helps.

ianjhoffman commented 2 years ago

Are there other good reasons to allow a context to be passed to Redis commands at all in go-redis since passing a deadline/timeout context is considered an antipattern according to this guidance?

vmihailenco commented 2 years ago

Yes, tracing requires context. Besides, it is not an antipattern - it is just hard to use context timeouts safely.

vmihailenco commented 2 years ago

But, perhaps, we should add an option to ignore context.Context timeout. Or just ignore context.Context when ReadTimeout/WriteTimeout are set.

klakekent commented 2 years ago

I tried to lower the pool size from 1300 to 800, and go-redis seems to work stable for about 2 weeks.

pedrocunha commented 2 years ago

We had one occurrence of this issue with r6g.xlarge and 6.2. Seems that tweaking the limiter to be more aggressive and reducing the pool size has helped us. We haven't had issues for nearly 3 weeks now. We haven't got rid of context deadline yet but it looks like a sensible thing to do, however in our case we would want these to be extremely small (< 200ms) as we rather still return something over waiting and clogging all the backend requests.

@enjmusic

Actually, our nodes are cache.r6g.large right now which means they wouldn't reap the benefits of multi-cpu optimization for TLS connections!

I understand wanting more than 1 vCPU but large has 2 vCPUs. Is there anything in particular of 2 vs many more vCPU in this case?

But, perhaps, we should add an option to ignore context.Context timeout. Or just ignore context.Context when ReadTimeout/WriteTimeout are set.

👍 on the latter

pete-woods commented 2 years ago

After experiencing issues like this we have a theory that when Go Redis is trying to get a connection from its pool, and the context expires before it gets one, it doesn't handle this situation quite right. Possibly half creating a new connection and immediately giving up.

klakekent commented 2 years ago

I got this issue again yesterday, but this time i found the error message was "redis: connection pool timeout",and tried to check the metrics in ElastiCache console and find at that the new connection suddenly went to 1.6k, so seems only to decrease the pool size not fix this issue.

yoni-veksler commented 2 years ago

Anything new ? we started to experience this too on non managed redis cluster, random times. Seems like related to context timeouts. We have set a pretty big one to try to mitigate it (>5s), but it still happens occasionally. @vmihailenco maybe connections should not be closed on timeouts ? and only on unrecoverable errors ? or at least allowing this to be configured.

MEZk commented 2 years ago

Hi, guys!

I was able to reproduce the issue.

Env

Redis 6.2.5 go-redis v8.11.4 go 1.17 macOs BigSur 11.6.3

Steps to reproduce:

docker run -p 6379:6379 redis
execute the following code:

package main

import (
    "context"
    "fmt"
    "sync"
    "time"

    "github.com/go-redis/redis/v8"
)

func main() {
    rc := redis.NewRing(&redis.RingOptions{
        Addrs:              map[string]string{"0": "localhost:6379"},
        DB:                 0,
        MaxRetries:         -1,
        DialTimeout:        1 * time.Second,
        ReadTimeout:        50 * time.Millisecond, // ReadTimeout should be lower then context timeout
        PoolSize:           30,
        MinIdleConns:       30,
        IdleTimeout:        -1,
        IdleCheckFrequency: -1,
    })

    wg := sync.WaitGroup{}

    for i := 0; i < 1000; i++ {
        wg.Add(1)
        go func(no int) {
            defer wg.Done()

            ctx := context.Background()
            // comment out next 2 lines of code to resolve timeout problem
            ctx, cancel := context.WithTimeout(ctx, 100*time.Millisecond)
            defer cancel()

            err := rc.Get(ctx, "foo").Err()

            fmt.Printf("%d: %s\n", no, err)
        }(i)
    }

    wg.Wait()
}

I also reproduced "use of closed network connection" problem with Redis pipeline:

package main

import (
    "context"
    "fmt"
    "sync"
    "time"

    "github.com/go-redis/redis/v8"
)

func main() {
    rc := redis.NewRing(&redis.RingOptions{
        Addrs:              map[string]string{"0": "localhost:6379"},
        DB:                 0,
        MaxRetries:         -1,
        DialTimeout:        1 * time.Second,
        ReadTimeout:        35 * time.Millisecond, // ReadTimeout should be lower then context timeout
        PoolSize:           50,
        MinIdleConns:       50,
        IdleTimeout:        -1,
        IdleCheckFrequency: -1,
    })

    wg := sync.WaitGroup{}

    for i := 0; i < 1000; i++ {
        wg.Add(1)
        go func(no int) {
            defer wg.Done()

            ctx := context.Background()
            // comment out next 2 lines of code to resolve timeout problem
            ctx, cancel := context.WithTimeout(ctx, 200*time.Millisecond)
            defer cancel()

            pipe := rc.Pipeline()

            pipe.Get(ctx, "foo1")
            pipe.Get(ctx, "foo2")
            pipe.Get(ctx, "foo3")

            _, err := pipe.Exec(ctx)

            fmt.Printf("%d: %s\n", no, err)
        }(i)
    }

    wg.Wait()
}

In the second example error with message "use of closed network connection" happens only with pipeline.

It seams that passing the context without timeout (as proposed here) works

As you can see ReadTimeout is lower than context timeout. When I do not set context timeout and use only ReadTimeout it works fine, but as far as I set context timeout I face timeouts and reuse of closed connection (in case of pipeline).

@yoni-veksler-px

maybe connections should not be closed on timeouts ?

I also think that connections should not be closed on timeouts.

@pete-woods

when Go Redis is trying to get a connection from its pool, and the context expires before it gets one, it doesn't handle this situation quite right.

Yes, it may be true.

@vmihailenco can you recheck my examples?

ameliagapin commented 1 year ago

Are there any updates here? We're experiencing exactly the same issues others have discussed here. We get spiking new connections that cause the Redis CPU to spike on at least one node. From there, everything goes to hell for a few minutes until the cluster recovers on its own.

We have tried tuning a lot of our config options and are currently running with a pool size of 64, but when we start up our service, we see a spike in connections that makes no sense given our settings. Immediately on start up, we see new connections spike to 10x what we'd expect when doing the math of pods X pool size.

This, in turn, causes issues as kubernetes rolls out more pods throughout the day as our traffic increases.

I updated our calls to go-redis to use a context.Background() instead of the context that we typically pass through, but we still see connection pool timeout errors.

konradreiche commented 1 year ago

For those who face the issue and perform primarily pipelined operations I would recommend giving github.com/joomcode/redispipe a shot which gives you a lot of freedom with no connection pool to worry about:

All known Golang redis connectors use a connection-per-request model with a connection pool, and provide only explicit pipelining.

This connector was created as implicitly pipelined from the ground up to achieve maximum performance in a highly concurrent environment. It writes all requests to single connection to redis, and continuously reads answers from another goroutine.

juancresc commented 1 year ago

we are also experiencing a similar issue. We use Cluster client with no context timeouts. Apparently the cause is the increasing amount of new connections which takes elasticache CPU super high

kjocevicius commented 1 year ago

We were having this issue with non-managed Redis Cluster. We had no timeouts/deadlines on Context, but we had lower read/write timeouts set on go-redis itself (150ms instead of default 3s) - removing these settings solved the issue.

It mostly all made sense after reading this article: https://uptrace.dev/blog/golang-context-timeout.html

tonyhb commented 1 year ago

How did that blog post clear up anything? A large portion of us — me included — aren't using context timeouts, yet we still see context deadline exceeded errors because the redis client is faulty.

tonyhb commented 1 year ago

I'm 100% sure this is the redis client. We're using memorydb, and whenever we see this exact issue we see new connections spin on our cluster:

We're on V8. We'll upgrade to V9, but people's hypothesis about spinning and recreating connections seems correct. There's an insidious bug that has runaway problems, and is exacerbated by Redis going to 100% CPU from this client churn.

ameliagapin commented 1 year ago

We solved it by switching to https://github.com/rueian/rueidis

tonyhb commented 1 year ago

We solved it by switching to https://github.com/rueian/rueidis

We've also done the same thing. Thank you.

arielzach commented 1 year ago

We are having the same issue with Redis 6.2. We are using the default configuration and experiencing spikes in connections.

arielzach commented 1 year ago

We solved it by switching to https://github.com/rueian/rueidis

We've also done the same thing. Thank you.

We did the same thing, migrating to rueidis didn't have any more problems.

Harleen13Singh commented 1 year ago

@vmihailenco I can see this boolean added in the V9 in the redis Options struct : // ContextTimeoutEnabled controls whether the client respects context timeouts and deadlines. // See https://redis.uptrace.dev/guide/go-redis-debugging.html#timeouts ContextTimeoutEnabled bool

Is this added to resolve the above issue?

rohansingh commented 4 months ago

We ran into similar issues with Memorystore (Google's version of Redis). This was happening even with ContextTimeoutEnabled: false, and a decent amount of work to make sure we were not passing in contexts with deadlines/timeouts.

I think that perhaps retrieving a connection from the connection pool ignores the ContextTimeoutEnabled setting.

However, we also just switched to redis/rueidis and it seems to have completely alleviated the issue, as well as reducing CPU usage on both our clients and Memorystore instances.

redis / go-redis