Protect against corrupting interrupts

tamird commented 9 years ago

We've seen this a lot in production, where a well-intentioned Timeout.timeout will interrupt ensure blocks and cause serious problems. See: http://headius.blogspot.com/2008/02/rubys-threadraise-threadkill-timeoutrb.html

tamird commented 9 years ago

@mperham looks like travis isn't set up for this repo :(

mperham commented 9 years ago

That's nuts. Timeouts bypass ensures???

tamird commented 9 years ago

Yup, nuts is an understatement

mperham commented 9 years ago

Btw I never recommend using Timeout.timeout in a multithreaded system. Quirks like this are exactly why. Use socket timeouts only.

tamird commented 9 years ago

That's sound advice, but Timeout.timeout is sometimes used to wrap multiple operations where individual socket timeouts are insufficiently expressive.

On another note, this fix is insufficient because the connection being checked back in can still be corrupted -- another commit on the way.

tamird commented 9 years ago

@mperham PTAL

tamird commented 9 years ago

This takes a different approach now, and simply discards any connection which is deemed unsafe to reuse.

mperham commented 9 years ago

Ugh, I hate this type of issue. Near impossible to test or understand, is the best I can do to turn a new release and wait for issues?

ggilder commented 9 years ago

Here's a script that reliably reproduces the issue: https://gist.github.com/ggilder/b531bdceea44a449c840

Against current connection_pool (or, of course, switched to just use one Redis connection), it will consistently return mismatched responses within 1-3 timeouts. With @tamird 's patch, it throws away the connection after nearly every timeout, but never returns a bad response.

zanker commented 9 years ago

@mperham It's not too bad to test, but it does make the test suite more complicated

We do it through acceptance tests against an HTTP server, but you can do the same thing with a basic TCPServer that sleeps and echos, quick drycode example:

server_thread = Thread.new do
  server = TCPServer.new("127.0.0.1", 10_000)
  while (line = server.readline) != nil
    sleep 4
    line.write(line)
  end
end

pool = ConnectionPool.new { TCPSocket.new("127.0.0.1", 10_000) }

thread = Thread.new do
  pool.with do |conn|
    conn.writeline("1")
  end
end

thread.join(1)
thread.kill

pool.with do |conn|
  conn.writeline("2")
  expect(conn.readline).to eq("2")
end

If it's handled properly, you'll get back 2 rather than a 1.

ggilder commented 9 years ago

Thanks @mperham ! Could you set up Travis for this repo? It seems that some of the tests have rotted under JRuby...

mperham commented 9 years ago

Who wants commit bit for this? You guys can set it up yourselves. @tamird?

tamird commented 9 years ago

yup, sounds good

mperham commented 9 years ago

Done, please stick with PRs rather than committing directly to master.

tamird commented 9 years ago

You got it. Thanks for the fast triage here!

tamird commented 9 years ago

@mperham turns out I can't set up travis because I need administrative access: http://docs.travis-ci.com/user/getting-started/#Step-two%3A-Activate-GitHub-Webhook

mperham commented 9 years ago

Oy vey, activated.

BTW I'll cut a new release on Monday unless y'all find something wrong.

mperham commented 9 years ago

https://twitter.com/tenderlove/status/556892700620775425

tenderlove commented 9 years ago

I'd appreciate it if someone can check my work since I'm super jet lagged, but here is a script that will make the connection pool fail to correctly check the connection back in.

require 'connection_pool'
require 'thread'
require 'monitor'

# Add a latch class so we can manipulate thread order
class Latch
  def initialize count = 1
    @count = count
    @lock = Monitor.new
    @cv = @lock.new_cond
  end

  def release
    @lock.synchronize do
      @count -= 1 if @count > 0
      @cv.broadcast if @count.zero?
    end
  end

  def await
    @lock.synchronize do
      @cv.wait_while { @count > 0 }
    end
  end
end

memcached = Class.new(ConnectionPool) {
  def checkin
    # add a puts so we can give the scheduler a chance to switch
    puts __method__
    super
  end

  def pop_connection
    # add a puts so we can give the scheduler a chance to switch
    puts __method__
    super
  end
}.new(size: 5, timeout: 5) { Object.new }

Thread.abort_on_exception = true
queue = SizedQueue.new 2

2.times.map {
  Thread.new {
    loop do
      job_start_latch = Latch.new
      job_finish_latch = Latch.new
      worker = Thread.new do
        # This latch is to ensure the thread is up and running before we have
        # a different thread attempt to kill it.
        job_start_latch.await

        memcached.with do |connection|
          # This latch is to notify the "killer" thread that the work is done,
          # and we're ready to be killed
          job_finish_latch.release
        end
      end
      queue << [job_start_latch, job_finish_latch, worker]
    end
  }
}

2.times.map {
  Thread.new {
    loop do
      start, finish, thread = *queue.pop
      start.release # let the target thread know that we're ready to kill it
      finish.await  # wait on the target thread to tell us it's time
      thread.kill   # kill the target thread
    end
  }
}.each(&:join)

You may notice the weird subclass. I added a puts so that I could give the scheduler a chance to switch to a different thread which would increase the odds of showing off the bug. Without the puts, I couldn't get the scheduler to switch, so I never saw the error.

If you run this against master, or against #70, you'll see a timeout error on checkout from timed_stack. I think adding the puts to encourage thread switching is valid, but I'm not 100% sure.

mperham commented 9 years ago

@tenderlove @tamird's work certainly helps a lot. I've gotten half a dozen reports from Sidekiq users about really weird Redis connection behavior (here's the latest) over the last 6 months. I suspect this fix will solve their problems in all but the rarest of edge cases, i.e. maybe it gets us from 99% to 99.99% reliability. I do want to get to 100% but this is still forward progress.

If anyone has ideas on what the right solution is long-term, please chime in on #70. I want to avoid splitting the conversation across two issues.

mperham commented 9 years ago

I've had several Sidekiq users complain that this change is leading to connection exhaustion. I believe the issue might be due to discard!. The connection pool does not close the connections it discards so they rely on Ruby's GC to run and close the socket or the socket to naturally time out. I'm going to put in a bit of a hack to work around the issue.

zanker commented 9 years ago

As a suggestion, discard! already calls shutdown. Any reason to not just pass a shutdown block in Sidekiq that does the same thing? Avoids having to do a special case.

mperham commented 9 years ago

@zanker The shutdown block cannot be passed right now. It's only passed when shutting down the entire pool.

zanker commented 9 years ago

@mperham Ah! Right, sorry forgot that's only set on shutdown not all the time.

mperham / connection_pool

Protect against corrupting interrupts #67