Closed tamird closed 9 years ago
@mperham looks like travis isn't set up for this repo :(
That's nuts. Timeouts bypass ensures???
Yup, nuts is an understatement
Btw I never recommend using Timeout.timeout in a multithreaded system. Quirks like this are exactly why. Use socket timeouts only.
That's sound advice, but Timeout.timeout
is sometimes used to wrap multiple operations where individual socket timeouts are insufficiently expressive.
On another note, this fix is insufficient because the connection being checked back in can still be corrupted -- another commit on the way.
@mperham PTAL
This takes a different approach now, and simply discards any connection which is deemed unsafe to reuse.
Ugh, I hate this type of issue. Near impossible to test or understand, is the best I can do to turn a new release and wait for issues?
Here's a script that reliably reproduces the issue: https://gist.github.com/ggilder/b531bdceea44a449c840
Against current connection_pool
(or, of course, switched to just use one Redis connection), it will consistently return mismatched responses within 1-3 timeouts. With @tamird 's patch, it throws away the connection after nearly every timeout, but never returns a bad response.
@mperham It's not too bad to test, but it does make the test suite more complicated
We do it through acceptance tests against an HTTP server, but you can do the same thing with a basic TCPServer that sleeps and echos, quick drycode example:
server_thread = Thread.new do
server = TCPServer.new("127.0.0.1", 10_000)
while (line = server.readline) != nil
sleep 4
line.write(line)
end
end
pool = ConnectionPool.new { TCPSocket.new("127.0.0.1", 10_000) }
thread = Thread.new do
pool.with do |conn|
conn.writeline("1")
end
end
thread.join(1)
thread.kill
pool.with do |conn|
conn.writeline("2")
expect(conn.readline).to eq("2")
end
If it's handled properly, you'll get back 2 rather than a 1.
Thanks @mperham ! Could you set up Travis for this repo? It seems that some of the tests have rotted under JRuby...
Who wants commit bit for this? You guys can set it up yourselves. @tamird?
yup, sounds good
Done, please stick with PRs rather than committing directly to master.
You got it. Thanks for the fast triage here!
@mperham turns out I can't set up travis because I need administrative access: http://docs.travis-ci.com/user/getting-started/#Step-two%3A-Activate-GitHub-Webhook
Oy vey, activated.
BTW I'll cut a new release on Monday unless y'all find something wrong.
I'd appreciate it if someone can check my work since I'm super jet lagged, but here is a script that will make the connection pool fail to correctly check the connection back in.
require 'connection_pool'
require 'thread'
require 'monitor'
# Add a latch class so we can manipulate thread order
class Latch
def initialize count = 1
@count = count
@lock = Monitor.new
@cv = @lock.new_cond
end
def release
@lock.synchronize do
@count -= 1 if @count > 0
@cv.broadcast if @count.zero?
end
end
def await
@lock.synchronize do
@cv.wait_while { @count > 0 }
end
end
end
memcached = Class.new(ConnectionPool) {
def checkin
# add a puts so we can give the scheduler a chance to switch
puts __method__
super
end
def pop_connection
# add a puts so we can give the scheduler a chance to switch
puts __method__
super
end
}.new(size: 5, timeout: 5) { Object.new }
Thread.abort_on_exception = true
queue = SizedQueue.new 2
2.times.map {
Thread.new {
loop do
job_start_latch = Latch.new
job_finish_latch = Latch.new
worker = Thread.new do
# This latch is to ensure the thread is up and running before we have
# a different thread attempt to kill it.
job_start_latch.await
memcached.with do |connection|
# This latch is to notify the "killer" thread that the work is done,
# and we're ready to be killed
job_finish_latch.release
end
end
queue << [job_start_latch, job_finish_latch, worker]
end
}
}
2.times.map {
Thread.new {
loop do
start, finish, thread = *queue.pop
start.release # let the target thread know that we're ready to kill it
finish.await # wait on the target thread to tell us it's time
thread.kill # kill the target thread
end
}
}.each(&:join)
You may notice the weird subclass. I added a puts
so that I could give the scheduler a chance to switch to a different thread which would increase the odds of showing off the bug. Without the puts
, I couldn't get the scheduler to switch, so I never saw the error.
If you run this against master, or against #70, you'll see a timeout error on checkout from timed_stack
. I think adding the puts
to encourage thread switching is valid, but I'm not 100% sure.
@tenderlove @tamird's work certainly helps a lot. I've gotten half a dozen reports from Sidekiq users about really weird Redis connection behavior (here's the latest) over the last 6 months. I suspect this fix will solve their problems in all but the rarest of edge cases, i.e. maybe it gets us from 99% to 99.99% reliability. I do want to get to 100% but this is still forward progress.
If anyone has ideas on what the right solution is long-term, please chime in on #70. I want to avoid splitting the conversation across two issues.
I've had several Sidekiq users complain that this change is leading to connection exhaustion. I believe the issue might be due to discard!
. The connection pool does not close the connections it discards so they rely on Ruby's GC to run and close the socket or the socket to naturally time out. I'm going to put in a bit of a hack to work around the issue.
As a suggestion, discard!
already calls shutdown. Any reason to not just pass a shutdown block in Sidekiq that does the same thing? Avoids having to do a special case.
@zanker The shutdown block cannot be passed right now. It's only passed when shutting down the entire pool.
@mperham Ah! Right, sorry forgot that's only set on shutdown not all the time.
We've seen this a lot in production, where a well-intentioned
Timeout.timeout
will interruptensure
blocks and cause serious problems. See: http://headius.blogspot.com/2008/02/rubys-threadraise-threadkill-timeoutrb.html