riaken / riaken-core

Go Protocol Buffer driver for the Riak distributed database
Other
31 stars 7 forks source link

Adding a Ping call to the session check loop #3

Closed bfosberry closed 9 years ago

bfosberry commented 9 years ago

This code fixes an issue were tcp keepalive does not keep a connection open above the ELB timeout, and tcp connections are dropped. The next request against that session will fail, mark the session as unavailable, and the session will reconnect for the next request, however a number of requests do fail in this time (equal to the pool size).

An alternaitve to this change would be to work some retry logic into the request cycle of the package to retry should the session be discovered to be unavailable when the request fails.

Review on Reviewable

mpbsbr commented 9 years ago

OK, I'm not quite getting this, and it's surely easier to just ask: a quick scan and it looks like s.check() ultimately does an s.ping() as well. We never get to s.check(), something in s.check() doesn't lead to the s.ping()?

bfosberry commented 9 years ago

While s.check does ultimately do a ping, c.check only calls s.check if it has already been marked as bad. An alternative to this PR would be to ALWAYS call s.check inside c.check however there is a lot more logic in s.check than just ping. Calling Ping generates traffic which performs a manual keepalive us, but I think you're right, a better solution would be something like setting s.active to s.Ping inside the c.check loop.

boj commented 9 years ago

@bfosberry Maybe it would be more sane to set http://golang.org/pkg/net/#TCPConn.SetKeepAlivePeriod to some configurable value inside Session#Dial instead?

bfosberry commented 9 years ago

We considered that, however could not find what the default value was so never got round to testing it. I'll check it tomorrow and see if it works.

boj commented 9 years ago

I was poking around the 'net about ELB timeouts. Seems poorly documented, but 75s looks like a decent number according to the lower part of this blog post http://engineering.chartbeat.com/2014/02/12/part-2-lessons-learned-tuning-tcp-and-nginx-in-ec2/ or you can apparently open a support ticket and ask for a custom value.

boj commented 9 years ago

@bfosberry Off topic, but are you guys using this in production yet?

bfosberry commented 9 years ago

As of about 20 mins ago, yes On Nov 13, 2014 6:04 PM, "Brian Jones" notifications@github.com wrote:

@bfosberry https://github.com/bfosberry Off topic, but are you guys using this in production yet?

— Reply to this email directly or view it on GitHub https://github.com/riaken/riaken-core/pull/3#issuecomment-62989770.

boj commented 9 years ago

@bfosberry haha, wow. You are officially the first production case I know of then. Good luck!

bfosberry commented 9 years ago

We are able to configure elb's timeout, and were able to match it to this issue, but for some reason tcp keep alive isnt working with it. We will try setting a timeout value tomorrow.

So we compared this project to a number of other riak clients, including our previous one, and decided it was a great fit for our needs and also seemed to be well written compared to the other options. We plan on using the yokozuma features when they are complete also.

We think that the reason we had not seen this issue before is because our previous client was traffic heavy, constantly re-dialing, and so connections did not remain open long enough to time out. As such when we switched to this client we noticed a performance boost also.

boj commented 9 years ago

Glad to hear you guys like this driver. Yokozuna is on my soon-to-do list. CRDTs are already in HEAD.

I'm curious about the TCP timeout though. By default this driver pings all the nodes every 5s in the background which seems fairly reasonable given a standard timeout of 60s. Is your ELB timing connections out faster than 5s? Is there anything else between your application and databases?

bfosberry commented 9 years ago

Well we realized riaken does not ping every 5 seconds, only reconnects if a failed request puts the session into a failed state, thus the added ping. Why keepalive is not working is beyond me, but I'll test that today

On Nov 14, 2014 12:33 AM, "Brian Jones" notifications@github.com wrote:

Glad to hear you guys like this driver. Yokozuna is on my soon-to-do list. CRDTs are already in HEAD.

I'm curious about the TCP timeout though. By default this driver pings all the nodes every 5s in the background which seems fairly reasonable given a standard timeout of 60s. Is your ELB timing connections out faster than 5s? Is there anything else between your application and databases?

— Reply to this email directly or view it on GitHub.

boj commented 9 years ago

Oops, for some reason I thought I was doing a consistent background ping every 5s this whole time. I'll merge this in since it actually adds the correct behaviour.

boj commented 9 years ago

There, that changes it to s.active = s.Ping()

bfosberry commented 9 years ago

Thank @boj!