Open ianjhoffman opened 2 years ago
@enjmusic thanks for creating the isssue.
Add logs in spots where bad connections are removed
Do you see any logs? I guess not which may indicate that the idle conns reaper is indeed the problem. You could try to disable it completely to confirm that. Or add some logging here and here if you really need IdleTimeout
.
@vmihailenco if I remember correctly, we were seeing a mixture of I/O timeout and context deadline exceeded logs when this issue was triggered.
We did completely disable idle connection reaping for a while (both the reaper goroutine and the IdleTimeout
) and while this made the issue happen less often, we did still see it triggered once! So while it may be a contributing factor I don't know if it's the root cause. We haven't seen this issue lately after:
but again, it's difficult to believe that we've completely addressed the issue or just made it exceedingly unlikely at our current scale. I'd be interested to know if we can get any more insights from @klakekent and @pedrocunha, assuming they're still seeing the issue periodically.
I/O timeout and context deadline
Well, it may be an indication that you need a larger server / better network. There is no evidence it is a bug in go-redis.
Well, it may be an indication that you need a larger server / better network. There is no evidence it is a bug in go-redis.
Our web servers running go-redis
code, as well as our ElastiCache Redis cluster, are both overprovisioned for the amount of work they're doing. They're always pretty bored when this issue is triggered. Given that:
go-redis
in non cluster mode in other applications with similar or greater load per node and never observed this issue over 2+ years in those contextsWith these points in mind, I think it's very reasonable to consider them enough evidence to continue taking a closer look at go-redis
. I would love to see more engagement around getting to the bottom of this issue rather than dismissing it a second time.
According to your log report closing bad conn: context deadline exceeded
, I checked the working process of go-redis in detail, it only appears in initConn
and withConn
, I believe this error is caused by context timeout .
But according to your reply:
- we use
go-redis
in non cluster mode in other applications with similar or greater load per node and never observed this issue over 2+ years in those contexts
You did not encounter this problem in non-redis cluster mode, so what is your context timeout in non-redis cluster mode? Is an ACL operation required(initConn
)?
skip initConn:
if c.opt.Password == "" && c.opt.DB == 0 && !c.opt.readOnly && c.opt.OnConnect == nil {
// skip initConn
}
go-redis ClusterClient
additionally caches the results of cluster slots
, but it is executed asynchronously (except when each node is initialized), but if your redis-cluster nodes change frequently, it may greatly increase the command execution time (ask moved
).
In addition, you may need to consider network issues (aws vpc), or the time spent processing commands by redis-cluster(> context.Timeout).
So, we have a theory about what might be going on. AWS ElastiCache Redis best practices dictate that one should use the "Cluster Configuration Endpoint" for cluster discovery, which we took to mean that it was ok to use that as our one seed address for the ClusterClient
, assuming that every time it was reached out to it would route to a different node in the cluster. The auto discovery section of the ElastiCache FAQ states that
the Amazon ElastiCache cluster is also given a unique Configuration Endpoint which is a DNS Record that is valid for the lifetime of the cluster. This DNS Record contains the DNS Names of the nodes that belong to the cluster. Amazon ElastiCache will ensure that the Configuration Endpoint always points to at least one such “target” node. A query to the target node then returns endpoints for all the nodes of the cluster in question
which we took to mean that it would balance between all nodes. However, there's likely some DNS caching going on. When I nslookup
our Configuration Endpoint I see the full list of IPs for all primaries and replicas, but when I openssl s_client -nocommands -connect <configuration_endpoint>
I always end up on the same node in the cluster. I think this means that when the ClusterClient
is seeded with just the Configuration Endpoint, in any situation where several goroutines executing code in cluster.go
all need to fall back on the initial Addrs
list provided when the ClusterClient
was created, they'll all hammer the same shard.
We may be able to get around this issue by performing a LookupIP
or similar query to get our Addrs
list so that the initial list of nodes lends itself well to a fair division of lookups across nodes when needed.
We're probably going to implement this safety measure no matter what, but my main question after looking into the Configuration Endpoint (and its use as the only seed address for a ClusterClient
) a little more is: what are the different spots in cluster.go
where we fall back on the initially provided Addrs
list from the ClusterOptions
? I started looking deeply into the code for this today but figured I'd miss some stuff given that there's a lot of code for clusterNodes
and clusterStateHolder
.
There seems to be no evidence of a DNS problem, in the logs you see "context deadline exceeded", we should determine where more time is being consumed causing the context to time out.
We were unable to reproduce this issue in a test environment, you can add more logs to go-redis to locate where time is consuming:
https://github.com/go-redis/redis/blob/40e84a79bfb4fafc2d7b6770f7434311de08ce48/cluster.go#L771
We have a theory about what might be going on... The auto discovery section...
@enjmusic Have you seen this post https://medium.com/tinder-engineering/taming-elasticache-with-auto-discovery-at-scale-dc5e7c4c9ad0 ?
Tinder found a bug in their Java client failing to dynamically reconfigure which IP it was connecting to following AWS Elasticache cluster maintenance events.
We finally saw this issue again during the night yesterday, but it affected eight different nodes, which we've never seen before. Previously it was always one. This may or may not disprove the cluster configuration endpoint issue. Again, temporarily stopping new ingress of jobs into Redis resolved the issue.
@dzqv just read the blog post! It's an interesting study but doesn't look like our issue where connection floods occur to some nodes independent of any cluster maintenance events
This time we did see the issue happen within a couple minutes of a
Certificate renewed for replication group <REDACTED> on Sun Mar 20 08:14:05 UTC 2022
... curious about whether or not that could put clients into a bad state
We still don't know what caused this but a couple of things:
Just wanna add that we are also seeing this issue. We are using clustered Elasticache Redis 6.0.5 on r6g instances with TLS enabled, our clients are running on EC2 as well.
We were able to greatly mitigate the issue by passing empty context to go-redis, instead of the context with deadline/timeout.
A few other params we also tuned to mitigate this issue:
PoolSize = MinIdleConns
MaxRetries
to -1The changes we made are helpful to reduce the symptom, but it's still happening from time to time. Almost always it started with read timeouts on a single node, and then a huge TLS connection spike to that node. It issue happens and self-resolves within a minutes
@pedrocunha Are there any re-occurrences on Elasticache Redis 6.2 since you posted? https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/supported-engine-versions.html#redis-version-6.2 Their 6.2 version added multi-cpu optimization for TLS connections, so maybe that's helping.
Today we saw this issue recur despite our Redis cluster being on
Engine Version Compatibility: 6.2.5
Not sure if that means the nodes are actually running 6.2 or just something to do with which commands work.
That said, we now have a circuit breaker implemented in our writers that stops writes when errors are detected for given keys, which has helped with autoremediation.
We had this new logging in place on our go-redis
fork for cluster state issues and the like, but didn't see any of it trigger during the issue.
Actually, our nodes are cache.r6g.large
right now which means they wouldn't reap the benefits of multi-cpu optimization for TLS connections!
We were able to greatly mitigate the issue by passing empty context to go-redis, instead of the context with deadline/timeout.
This article should explain why that helps.
Are there other good reasons to allow a context to be passed to Redis commands at all in go-redis
since passing a deadline/timeout context is considered an antipattern according to this guidance?
But, perhaps, we should add an option to ignore context.Context
timeout. Or just ignore context.Context
when ReadTimeout/WriteTimeout are set.
I tried to lower the pool size from 1300 to 800, and go-redis seems to work stable for about 2 weeks.
We had one occurrence of this issue with r6g.xlarge and 6.2. Seems that tweaking the limiter to be more aggressive and reducing the pool size has helped us. We haven't had issues for nearly 3 weeks now. We haven't got rid of context deadline yet but it looks like a sensible thing to do, however in our case we would want these to be extremely small (< 200ms) as we rather still return something over waiting and clogging all the backend requests.
@enjmusic
Actually, our nodes are cache.r6g.large right now which means they wouldn't reap the benefits of multi-cpu optimization for TLS connections!
I understand wanting more than 1 vCPU but large has 2 vCPUs. Is there anything in particular of 2 vs many more vCPU in this case?
But, perhaps, we should add an option to ignore context.Context timeout. Or just ignore context.Context when ReadTimeout/WriteTimeout are set.
👍 on the latter
After experiencing issues like this we have a theory that when Go Redis is trying to get a connection from its pool, and the context expires before it gets one, it doesn't handle this situation quite right. Possibly half creating a new connection and immediately giving up.
I got this issue again yesterday, but this time i found the error message was "redis: connection pool timeout",and tried to check the metrics in ElastiCache console and find at that the new connection suddenly went to 1.6k, so seems only to decrease the pool size not fix this issue.
Anything new ? we started to experience this too on non managed redis cluster, random times. Seems like related to context timeouts. We have set a pretty big one to try to mitigate it (>5s), but it still happens occasionally. @vmihailenco maybe connections should not be closed on timeouts ? and only on unrecoverable errors ? or at least allowing this to be configured.
Hi, guys!
I was able to reproduce the issue.
Redis 6.2.5 go-redis v8.11.4 go 1.17 macOs BigSur 11.6.3
package main
import (
"context"
"fmt"
"sync"
"time"
"github.com/go-redis/redis/v8"
)
func main() {
rc := redis.NewRing(&redis.RingOptions{
Addrs: map[string]string{"0": "localhost:6379"},
DB: 0,
MaxRetries: -1,
DialTimeout: 1 * time.Second,
ReadTimeout: 50 * time.Millisecond, // ReadTimeout should be lower then context timeout
PoolSize: 30,
MinIdleConns: 30,
IdleTimeout: -1,
IdleCheckFrequency: -1,
})
wg := sync.WaitGroup{}
for i := 0; i < 1000; i++ {
wg.Add(1)
go func(no int) {
defer wg.Done()
ctx := context.Background()
// comment out next 2 lines of code to resolve timeout problem
ctx, cancel := context.WithTimeout(ctx, 100*time.Millisecond)
defer cancel()
err := rc.Get(ctx, "foo").Err()
fmt.Printf("%d: %s\n", no, err)
}(i)
}
wg.Wait()
}
I also reproduced "use of closed network connection" problem with Redis pipeline:
package main
import (
"context"
"fmt"
"sync"
"time"
"github.com/go-redis/redis/v8"
)
func main() {
rc := redis.NewRing(&redis.RingOptions{
Addrs: map[string]string{"0": "localhost:6379"},
DB: 0,
MaxRetries: -1,
DialTimeout: 1 * time.Second,
ReadTimeout: 35 * time.Millisecond, // ReadTimeout should be lower then context timeout
PoolSize: 50,
MinIdleConns: 50,
IdleTimeout: -1,
IdleCheckFrequency: -1,
})
wg := sync.WaitGroup{}
for i := 0; i < 1000; i++ {
wg.Add(1)
go func(no int) {
defer wg.Done()
ctx := context.Background()
// comment out next 2 lines of code to resolve timeout problem
ctx, cancel := context.WithTimeout(ctx, 200*time.Millisecond)
defer cancel()
pipe := rc.Pipeline()
pipe.Get(ctx, "foo1")
pipe.Get(ctx, "foo2")
pipe.Get(ctx, "foo3")
_, err := pipe.Exec(ctx)
fmt.Printf("%d: %s\n", no, err)
}(i)
}
wg.Wait()
}
In the second example error with message "use of closed network connection" happens only with pipeline.
It seams that passing the context without timeout (as proposed here) works
As you can see ReadTimeout is lower than context timeout. When I do not set context timeout and use only ReadTimeout it works fine, but as far as I set context timeout I face timeouts and reuse of closed connection (in case of pipeline).
@yoni-veksler-px
maybe connections should not be closed on timeouts ?
I also think that connections should not be closed on timeouts.
@pete-woods
when Go Redis is trying to get a connection from its pool, and the context expires before it gets one, it doesn't handle this situation quite right.
Yes, it may be true.
@vmihailenco can you recheck my examples?
Are there any updates here? We're experiencing exactly the same issues others have discussed here. We get spiking new connections that cause the Redis CPU to spike on at least one node. From there, everything goes to hell for a few minutes until the cluster recovers on its own.
We have tried tuning a lot of our config options and are currently running with a pool size of 64, but when we start up our service, we see a spike in connections that makes no sense given our settings. Immediately on start up, we see new connections spike to 10x what we'd expect when doing the math of pods X pool size.
This, in turn, causes issues as kubernetes rolls out more pods throughout the day as our traffic increases.
I updated our calls to go-redis
to use a context.Background()
instead of the context that we typically pass through, but we still see connection pool timeout
errors.
For those who face the issue and perform primarily pipelined operations I would recommend giving github.com/joomcode/redispipe a shot which gives you a lot of freedom with no connection pool to worry about:
All known Golang redis connectors use a connection-per-request model with a connection pool, and provide only explicit pipelining.
This connector was created as implicitly pipelined from the ground up to achieve maximum performance in a highly concurrent environment. It writes all requests to single connection to redis, and continuously reads answers from another goroutine.
we are also experiencing a similar issue. We use Cluster client with no context timeouts. Apparently the cause is the increasing amount of new connections which takes elasticache CPU super high
We were having this issue with non-managed Redis Cluster. We had no timeouts/deadlines on Context, but we had lower read/write timeouts set on go-redis
itself (150ms instead of default 3s) - removing these settings solved the issue.
It mostly all made sense after reading this article: https://uptrace.dev/blog/golang-context-timeout.html
How did that blog post clear up anything? A large portion of us — me included — aren't using context timeouts, yet we still see context deadline exceeded errors because the redis client is faulty.
I'm 100% sure this is the redis client. We're using memorydb, and whenever we see this exact issue we see new connections spin on our cluster:
We're on V8. We'll upgrade to V9, but people's hypothesis about spinning and recreating connections seems correct. There's an insidious bug that has runaway problems, and is exacerbated by Redis going to 100% CPU from this client churn.
We solved it by switching to https://github.com/rueian/rueidis
We solved it by switching to https://github.com/rueian/rueidis
We've also done the same thing. Thank you.
We are having the same issue with Redis 6.2. We are using the default configuration and experiencing spikes in connections.
We solved it by switching to https://github.com/rueian/rueidis
We've also done the same thing. Thank you.
We did the same thing, migrating to rueidis didn't have any more problems.
@vmihailenco I can see this boolean added in the V9 in the redis Options struct : // ContextTimeoutEnabled controls whether the client respects context timeouts and deadlines. // See https://redis.uptrace.dev/guide/go-redis-debugging.html#timeouts ContextTimeoutEnabled bool
Is this added to resolve the above issue?
We ran into similar issues with Memorystore (Google's version of Redis). This was happening even with ContextTimeoutEnabled: false
, and a decent amount of work to make sure we were not passing in contexts with deadlines/timeouts.
I think that perhaps retrieving a connection from the connection pool ignores the ContextTimeoutEnabled
setting.
However, we also just switched to redis/rueidis and it seems to have completely alleviated the issue, as well as reducing CPU usage on both our clients and Memorystore instances.
(See previous issue for additional context and discussion)
Expected Behavior
A
ClusterClient
with bare-minimum settings for connecting to an AWS ElastiCache Redis cluster (from here on referred to as ECRedis) will, at no point, contribute to a scenario in which it is constantly reestablishing connections to an ECRedis cluster node.Example
ClusterClient
config:Current Behavior
(The Current Behavior section in the original issue is still accurate to this issue)
Occasionally, we see connections being constantly re-established to one of our ECRedis cluster nodes at the limit of how many new connections are possible (~15k/minute is the reported rate). Redis nodes are essentially single-threaded and negotiating TLS for new connections takes up 100% of this node's CPU, preventing it from doing any other work. The time at which this issue occurs seems random, and we cannot correlate it to:
When this issue happens, running
CLIENT LIST
on the affected Redis node showsage=0
orage=1
for all connections every time, which reinforces that connections are being dropped constantly for some reason. New connections plummet on other shards in the Redis cluster, and are all concentrated on one.After discussion on the previous issue we are no longer the only ones experiencing this unknown problem, and it is blocking us from further relying on this service we've been building (and delaying full use of) for well over 6 months now.
Possible Solution
Redis that quickly devolves into constantly reestablishing connections to some node in the AWS ElastiCache Redis cluster.
ClusterClient
should react more gracefully toAfter the previous issue, we have tried a variety of approaches to mitigate this problem, none of which have solved it entirely:
context.Context
s passed to Redis commands to 1 second or greaterMULTI
/EXEC
pipelinesWe are seeing the issue a lot less often after
but are not confident that these changes constitute a permanent fix and are instead a consequence of there being fewer potential resources to overload an individual ECRedis cluster node.
Details About Fuzzing Idle Connection Cleanup
We opted to set
IdleTimeout
andIdleCheckFrequency
on theredis.ClusterOptions
passed to our client to-1
and added a customNewClient
implementation:Resetting the System to a Good State
When this issue happens, we are generally able to reset it to a good state by telling producers to stop producing new work to the system, which in turn means our consumer goroutines running the algorithm detailed later in this issue stop seeing new work items and their algorithm essentially just becomes
SRANDMEMBER pending 10
-> sleep 10ms -> repeat. We can then resume producer writes and the system goes back into a good state without connections being constantly reestablished. This is an improvement from our solution in the previous issue: completely stopping all services and then turning them back on after some period of time. However, it's not a tenable solution to this issue if we are to continue onboarding more work to the system in production.Steps to Reproduce
The description of our environment/service implementation below, as well as the snippet of our NewClusterClient call at the beginning of this issue, provide a fairly complete summary of how we're using both
go-redis
and ECRedis. We've not been able to consistently trigger this issue since it often happens when we're not load testing. I'd be interested to know if the other commenters on the original issue, @klakekent and @pedrocunha, have had any luck reproducing the issue. It sounds like it's been difficult to reproduce for them as well.Context (Environment)
We're running a service that has a simple algorithm for claiming work from a Redis set, doing something (unrelated to Redis) with it, and then cleaning it up from Redis. In a nutshell, the algorithm is as follows:
SRANDMEMBER pending 10
- grab up to 10 random items from the pool of available workZADD in_progress <current_timestamp> <grabbed_item>
for each of our items we got in the previous stepZADD
have been claimed by some other instance of the service, skip themSREM pending <grabbed_item>
ZREMRANGEBYSCORE in_progress -inf <5_seconds_ago>
so that claimed items aren't claimed forever, but items can only be claimed exactly once during a 5 second windowWork item producers simply
SADD pending <work_item>
but the load they produce is miniscule compared to the consumers running the algorithm outlined above.Currently we run this algorithm on 22 EC2 instances, each running one service. We've configured our
ClusterClient
to use aPoolSize
of10
(reduced from the default of20
on these instances) for each childClient
it creates. Each service has 25 goroutines performing this algorithm, and each goroutine sleeps 10ms between each invocation of the algorithm.At a steady state with no load on the system (just a handful of heartbeat jobs being added to pending every minute) we see a maximum of ~8% EngineCPUUtilization on each Redis shard, and 1-5 new connections/minute. Overall, pretty relaxed. When this issue has triggered recently, it's happened from this steady state, not during load tests.
Our service is running on EC2 instances running Ubuntu 18.04 (Bionic), and we have tried using
github.com/go-redis/redis/v8 v8.0.0
,github.com/go-redis/redis/v8 v8.11.2
, andgithub.com/go-redis/redis/v8 v8.11.4
- all 3 have run into this issue. We are now using a fork ofv8.11.4
(see added commits here) with the following changes (some suggested in the previous issue, some thought of by us):ClusterClient.process
method for logging/metrics around various anomalies during processing commands (MOVED
/ASK
responses, errors getting nodes, node failures). We do not see these while the issue is happening.We're currently running with a 24-shard ElastiCache Redis cluster with TLS enabled, where each shard is a primary/replica pair of
cache.r6g.large
instances.Detailed Description
N/A
Possible Implementation
N/A