mediocregopher / radix.v2

Redis client for Go
http://godoc.org/github.com/mediocregopher/radix.v2
MIT License
433 stars 92 forks source link

Optional rate limiting of new connections and help prevent churn #79

Closed jameshartig closed 6 years ago

jameshartig commented 6 years ago

We've been having an issue in production where thousands of connections are created/closed/repeat within a few seconds and the service runs out of file descriptors (because the old connections are left in TIME_WAIT) and is now in an unusable state. This PR adds a few things:

First, a "reserve pool" was added to Pool and has a capacity of 10*size. If the main pool is full, this reserve pool is used to temporarily hold onto a client in case it is immediately needed again. This should help prevent connection churn if more than size connections are needed temporarily. Over a period of 100 seconds, the reserve pool is drained re-using the ping timer.

Second, instead of potentially creating a new connection just to ping, we only look for a client in the pool.

Third, an optional "limited" pool can be created with a token bucket system that only allows size clients to be created roughly every 10 seconds. The bucket is refilled at a rate of 10/size seconds. This also re-uses the ping timer. I left this default to off since technically it's backward-incompatible but we plan on turning it on in almost all cases. A PoolLimited bool was added to the cluster.Opts struct to enable this functionality.

jameshartig commented 6 years ago

Updated to remove the signature change to pool.NewCustom and instead implemented another goroutine in a pool.NewCustomLimited method.

jameshartig commented 6 years ago

I wrote the following code to test the behavior with and without my changes and with and without a limited pool: https://gist.github.com/fastest963/1976bf549c91998f9f37435663eb66cf

before:

$ ./testpool --parallel 250 --size 10
2018/04/04 13:26:23 num connections 248
2018/04/04 13:26:23 num new connections 652
2018/04/04 13:26:23 num pings 23191
2018/04/04 13:26:23 mean ping duration 9ms
2018/04/04 13:26:24 num connections 250
2018/04/04 13:26:24 num new connections 555
2018/04/04 13:26:24 num pings 23421
2018/04/04 13:26:24 mean ping duration 9ms
2018/04/04 13:26:25 num connections 250
2018/04/04 13:26:25 num new connections 443
2018/04/04 13:26:25 num pings 22872
2018/04/04 13:26:25 mean ping duration 9ms
2018/04/04 13:26:26 num connections 250
2018/04/04 13:26:26 num new connections 309
2018/04/04 13:26:26 num pings 21653
2018/04/04 13:26:26 mean ping duration 10ms
2018/04/04 13:26:27 num connections 251
2018/04/04 13:26:27 num new connections 359
2018/04/04 13:26:27 num pings 23481
2018/04/04 13:26:27 mean ping duration 9ms
2018/04/04 13:26:28 num connections 250
2018/04/04 13:26:28 num new connections 406
2018/04/04 13:26:28 num pings 19325
2018/04/04 13:26:28 mean ping duration 11ms

after (not limited):

$ ./testpool --parallel 250 --size 10
2018/04/04 13:38:44 num connections 250
2018/04/04 13:38:44 num new connections 290
2018/04/04 13:38:44 num pings 34920
2018/04/04 13:38:44 mean ping duration 6ms
2018/04/04 13:38:45 num connections 250
2018/04/04 13:38:45 num new connections 29
2018/04/04 13:38:45 num pings 35537
2018/04/04 13:38:45 mean ping duration 6ms
2018/04/04 13:38:46 num connections 250
2018/04/04 13:38:46 num new connections 18
2018/04/04 13:38:46 num pings 35723
2018/04/04 13:38:46 mean ping duration 6ms
2018/04/04 13:38:47 num connections 251
2018/04/04 13:38:47 num new connections 1
2018/04/04 13:38:47 num pings 35136
2018/04/04 13:38:47 mean ping duration 6ms
2018/04/04 13:38:48 num connections 251
2018/04/04 13:38:48 num new connections 0
2018/04/04 13:38:48 num pings 37500
2018/04/04 13:38:48 mean ping duration 6ms
2018/04/04 13:38:49 num connections 250
2018/04/04 13:38:49 num new connections 17
2018/04/04 13:38:49 num pings 38082
2018/04/04 13:38:49 mean ping duration 5ms
2018/04/04 13:38:50 num connections 250
2018/04/04 13:38:50 num new connections 1
2018/04/04 13:38:50 num pings 37271
2018/04/04 13:38:50 mean ping duration 6ms

after (limited):

$ ./testpool --limited --parallel 250 --size 10
2018/04/04 13:23:29 num connections 20
2018/04/04 13:23:29 num new connections 20
2018/04/04 13:23:29 num pings 38896
2018/04/04 13:23:29 mean ping duration 5ms
2018/04/04 13:23:30 num connections 21
2018/04/04 13:23:30 num new connections 1
2018/04/04 13:23:30 num pings 38749
2018/04/04 13:23:30 mean ping duration 5ms
2018/04/04 13:23:31 num connections 22
2018/04/04 13:23:31 num new connections 1
2018/04/04 13:23:31 num pings 37484
2018/04/04 13:23:31 mean ping duration 6ms
2018/04/04 13:23:32 num connections 23
2018/04/04 13:23:32 num new connections 1
2018/04/04 13:23:32 num pings 36209
2018/04/04 13:23:32 mean ping duration 6ms
2018/04/04 13:23:33 num connections 24
2018/04/04 13:23:33 num new connections 1
2018/04/04 13:23:33 num pings 36280
2018/04/04 13:23:33 mean ping duration 6ms
2018/04/04 13:23:34 num connections 25
2018/04/04 13:23:34 num new connections 1
2018/04/04 13:23:34 num pings 39232
2018/04/04 13:23:34 mean ping duration 5ms

To summarize:

mediocregopher commented 6 years ago

I implemented some of your changes in the pool in radix.v3: https://github.com/mediocregopher/radix.v3/commit/93d26b06385ad735e6f95a64b65d65cfa8e07686

I think the option pattern is good for here, especially since radix.v2 is split up into separate packages, so it's even a little more clean. By making NewCustom take in a variadic options array we could prevent having to make a backwards incompatible change. And ClusterOpts could take in a []pool.Opt as a field maybe? Not sure there...

jameshartig commented 6 years ago

I implemented the Opt pattern much like you did in v3. I also added a headroom to the limits because you'd probably want there to be a buffer so you can burst above the pool size and then only start limiting once it gets insane. The headroom can be 0 if you don't want that, but we definitely do.

jameshartig commented 6 years ago

Here are some new numbers from the test:

after (buffer size: 10*poolSize) (no creation limit)

2018/05/04 12:54:28 num connections 250
2018/05/04 12:54:28 num new connections 250
2018/05/04 12:54:28 num available 2
2018/05/04 12:54:28 num pings 28707
2018/05/04 12:54:28 mean ping duration 7ms
2018/05/04 12:54:29 num connections 250
2018/05/04 12:54:29 num new connections 0
2018/05/04 12:54:29 num available 0
2018/05/04 12:54:29 num pings 30229
2018/05/04 12:54:29 mean ping duration 7ms
2018/05/04 12:54:30 num connections 250
2018/05/04 12:54:30 num new connections 0
2018/05/04 12:54:30 num available 0
2018/05/04 12:54:30 num pings 29302
2018/05/04 12:54:30 mean ping duration 7ms
2018/05/04 12:54:31 num connections 250
2018/05/04 12:54:31 num new connections 28
2018/05/04 12:54:31 num available 0
2018/05/04 12:54:31 num pings 29613
2018/05/04 12:54:31 mean ping duration 7ms
2018/05/04 12:54:32 num connections 250
2018/05/04 12:54:32 num new connections 60
2018/05/04 12:54:32 num available 0
2018/05/04 12:54:32 num pings 26975
2018/05/04 12:54:32 mean ping duration 8ms
2018/05/04 12:54:33 num connections 250
2018/05/04 12:54:33 num new connections 1
2018/05/04 12:54:33 num available 0
2018/05/04 12:54:33 num pings 28690
2018/05/04 12:54:33 mean ping duration 8ms

after (buffer size: 10*poolSize) (limit headroom: 10, interval: 100ms)

2018/05/04 13:01:29 num connections 29
2018/05/04 13:01:29 num new connections 29
2018/05/04 13:01:29 num available 0
2018/05/04 13:01:29 num pings 35413
2018/05/04 13:01:29 mean ping duration 6ms
2018/05/04 13:01:30 num connections 39
2018/05/04 13:01:30 num new connections 10
2018/05/04 13:01:30 num available 0
2018/05/04 13:01:30 num pings 36249
2018/05/04 13:01:30 mean ping duration 6ms
2018/05/04 13:01:31 num connections 49
2018/05/04 13:01:31 num new connections 10
2018/05/04 13:01:31 num available 0
2018/05/04 13:01:31 num pings 32829
2018/05/04 13:01:31 mean ping duration 7ms
2018/05/04 13:01:32 num connections 59
2018/05/04 13:01:32 num new connections 10
2018/05/04 13:01:32 num available 0
2018/05/04 13:01:32 num pings 31136
2018/05/04 13:01:32 mean ping duration 7ms
2018/05/04 13:01:33 num connections 69
2018/05/04 13:01:33 num new connections 10
2018/05/04 13:01:33 num available 0
2018/05/04 13:01:33 num pings 30998
2018/05/04 13:01:33 mean ping duration 7ms
2018/05/04 13:01:34 num connections 79
2018/05/04 13:01:34 num new connections 10
2018/05/04 13:01:34 num available 0
2018/05/04 13:01:34 num pings 31285
2018/05/04 13:01:34 mean ping duration 7ms
jameshartig commented 6 years ago

Some small changes in the main commit. Did you mean to include those other two commits that you'd already submitted prs for?

Yes, unfortunately I made this PR from master and so I needed to keep this up-to-date with those changes so that we could keep deploying my fork instead of this one to A/B test.

When you merge those PRs, I'll rebase this one.

jameshartig commented 6 years ago

I updated the docs file, added a test for the 0-sized pool (with a buffer), and added a test to ensure that reserve-evicted connections are added back to the main pool.

mieczkowski commented 6 years ago

@mediocregopher Is there any chance to merge it fast? It would be helpfull in our project and I don't want to change library that we are using :)

mediocregopher commented 6 years ago

Sorry that took so long, @fastest963 I'm gonna just close those other prs, was easier to merge them all here

mieczkowski commented 6 years ago

@fastest963 @mediocregopher It would be nice to have some timeout in CreateLimit(headroom int, createInterval time.Duration, waitTimeout time.Duration) . If request timeout is for example 2 seconds - wait max 2 seconds for connection, otherwise return timeout error. So in:

func (p *Pool) Get() (*redis.Client, error)

should be something like:

case <-p.limited:
            return p.df(p.Network, p.Addr)
case <-time.After(p.po.createWaitTimeout):
            return nil, errors.New("pool timeout")

What do you think about that? Or maybe context.Context support and deadline/timeout ? But it will require more changes in code...

jameshartig commented 6 years ago

@mediocregopher did something similar in v3 and I can add that as a separate option function but it's probably better to instead to make a GetContext method that accepts a context.

mediocregopher commented 6 years ago

Might as well add that third argument, I'd rather break the API now when it's only been a few days since merging and probably not that many people are using it, than to add an ugly extra method.

I don't think messing around with contexts is a good idea, as nothing else in this project uses them.

jameshartig commented 6 years ago

@mieczkowski @mediocregopher added a new option here: https://github.com/mediocregopher/radix.v2/pull/88