Closed KJTsanaktsidis closed 6 years ago
Thanks a bunch for your review. I know this kind of code is really complex to reason about so I'm really glad to have another pair of eyes on it! Let me know if the above makes sense or if I just sound crazy...
The test failure in TestStreamsNextNoEventTimeout
just on go 1.9 mongo 3.6 looks unrelated...
Hi @KJTsanaktsidis,
This is a great improvement to the limit code, it seems a lot of people use the pool limit so this will definitely be a big win!
I'll try and get something together to drive this code path properly - all my existing benchmark/verification setup limits concurrency at the application layer - it might take a couple days, I hope you understand.
Thanks for taking the time to improve this - it's really appreciated - I'll get back to you!
Dom
No worries, thanks for your effort picking up this project and maintaining this important piece of infrastructure for Go/Mongo users!
Looks like I got some conflicts from #116 - I'll push up a fix shortly. I'm still hoping to get this into production for our product in the next couple of weeks and give you a report on how it went.
@domodwyer I added a couple of commits to this PR to report metrics around how long goroutines spend waiting for the pool limit. In order to deploy this to production, we want to be able to report pool wait times into a Datadog histogram. Unfortunately, the stats mechanism in mgo is only really good for counters and gauges.
I added in the last commit a mechanism for mgo to send event-based information to the library user over a set of channels; the user of the library can optionally watch these channels and do something with the events (we are sending them to statsd for histogram/percentile estimation).
We kind of need this information in some way to know if the connection limit stuff is going to have an adverse effect, but I'm totally flexible on the API design if you have any better ideas. The channel-solution had the benefit of being low-overhead, optional, and unopinionated about what to do with event data. It also shouldn't be a breaking API change to add extra channels to the struct to publish different kinds of events. I did consider doing histogram bucketing in the stats code, but that raises lots of questions about what buckets to use, what percentiles to export, etc.
Hi @KJTsanaktsidis. I like the metrics idea. Here is a suggestion though. What do you think about making the Stats an interface? Something like
type MgoMetrics interface{
OnSocketAcquired SocketAcquireMetric
OnPoolTimeout SocketAcquireMetric
}
You could modify your current implementation to send data on the chan on every call of the interface, provide a default NOP implementation if library user doesn't want stats.
Your current implementation could stay as a reference one, it's perfectly valid. I am suggesting it because:
Setting an arbitrary chan size rubs be the wrong way. Using my suggested approach the user would be able to decide by himself what channel sizes to use.
I don't think that the metrics (while very useful !) are an integral part of the database API. Using the interface approach is IMHO cleaner. The reference metrics implementation could even stay in another package, somewhere under mgo/metrics?
The user would have an easier time integrating the metrics in their own system if they wanted to have a struct gathering and processing metrics from the whole system, that would implement the mgo metrics interface among other things.
I did consider the interface approach - unfortunately, if we did that, adding a new kind of event to it would be a breaking change since people's structs would no longer match the expanded interface. In a similar vein, I considered a struct with a whole bunch of callback-members you could assign, like this:
type StatsEventsCallbacks struct {
OnSocketAcquired func(time.Duration),
OnPoolTimeout func(time.Duration),
}
// library user code
callbacks := &StatsEventsCallbacks{
OnSocketAcquired: func(t time.Duration) {
// something
}
}
mgo.SetStatsEventsCallbacks(callbacks)
That way, if new events were added, users could just not fill them in on their structs and so it wouldn't be a breaking change. Ultimately the channel sends just felt more idiomatic go to me, but if you prefer the callback approach I'm more than happy to roll with that instead (and maybe just wrap it in to a set of channels in my application code).
If the only thing that concerns you about the channel approach is the arbitrary channel size, I'm also happy to let the caller specify that in some kind of SetStatsEvents(enabled bool, bufferSize int)
call and to separate it out of the SetStats
/ResetStats
mechanism entirely.
Thanks again for the feedback by the way :)
FYI - I have shipped this change into production for my team today. I will report back any findings in maybe a week or so after it's baked for a bit.
Hi @KJTsanaktsidis
Again, sorry for disappearing as I was out of the country but seems @szank has been keeping this going along 👍
I'm a little concerned this has grown the scope of the PR a fair bit - I understand you needed it for metrics to ensure the fix worked as expected, but as you correctly point out there's quite a lot of complexity around the existing, and proposed metrics (how to collect, who's job it is to aggregate, if/how to handle metric drops, etc.)
You're totally right though, it's time the stats were improved - I think it might serve the public mgo better to tackle this as a separate PR however, maybe in a different branch till we're happy with the design and get this connection pooling change merged sooner rather than later. I might have an idea for a lockless solution that should be backwards compatible, but I'm currently trying to stave off jet lag by keeping myself busy so might not fit the job ;)
Thanks again for all your hard work and solid solutions.
Dom
Yeah I totally understand the metrics raise some broader design questions about how that should be done. Sounds like I should back that bit out tomorrow; its not a huge patch for us to carry out of tree and it won’t really stop us switching to an upstream solution once that’s baked.
Hopefully this metrics stuff has been a good spike to explore some ideas :)
Other than letting this bake in our production environment for a bit, is there anything else I should do here do you think?
FWIW I’d be keen to hear your thoughts on the metrics when you’ve put them together too :) I’ll keep a look out here!
No I think the connections bit of the PR is all good! No problems with a merge, I'll run a test during the normal release process and should be good to go for a release.
Regarding the metrics I'd love it if you helped collaborate on the design/implementation/whatever-you-like - I'll open an issue for the improvement design in the coming days - even just a "this would work for us" or a "we also need X" would be most appreciated!
Dom
Alright, I've done some rebasing. I've left in "Add stats for connection pool timings", which simply adds some counters to the existing stats struct, but taken out "send events through a channel", which added the new API surface for receiving individual events. If you prefer I can remove the counters too but they don't add much in the way of new API style so I think they don't make a ton of tech debt and they're useful enough to leave there.
Honestly, this is great work - thanks @KJTsanaktsidis.
Dom
Thank you for your help!
Just an update on how this is going in our production usage. I had the pool limit kick in a couple of times this week with our app, and it seems that my change had the desired effect - the wait time hit a max of ~3ms, whereas the old code would have had this at at least 100ms due to the sleep-polling. This can be seen in the attached datadog graph (sorry about the redactions); when the app goes to open up a whole bunch of new connections, the limit kicks in and queues everything up for a few ms:
All in all, I'm pretty happy with how this is going. (subsequent to this graph, i turned on maxIdleTimeMs as well, which makes the surge in connections drop off after a couple of minutes - thanks to @gnawux !).
That's cool! Thanks for sharing (and thanks @gnawux too!).
I'll look at getting our staging environment up and running to check for any performance regressions, but I don't expect to find any - after that I'll cut a release.
Dom
The current behaviour when the poolLimit is reached and a new connection is required is to poll every 100ms to see if there is now headroom to make a new connection. This adds tremendous latency to the limit-hit-path.
This commit changes the checkout behaviour to watch on a condition variable for connections to become available, and the checkin behaviour to signal this variable. This should allow waiters to use connections immediately after they become available.
A new parameter is also added to DialInfo, PoolTimeout, which is the maximum time that clients will wait for connection headroom to become available. By default this is unlimited.
Full disclosure - I haven't yet actually tried this in our production environment yet. The only testing this has so far is that it passes the regression tests. I'm pull-requesting this early to get some feedback on the design of this change and to see if this would be useful for other people :). I intend to try and get this into production sometime in the next couple of weeks.