iconara / cql-rb

Cassandra CQL 3 binary protocol driver for Ruby
106 stars 31 forks source link

(ThreadError) "Attempt to unlock a mutex which is not locked" #68

Closed rajofchennai closed 10 years ago

rajofchennai commented 10 years ago

I am getting this error randomly from my application. I am using ruby2.0.0. This happens when I use cql-rb in a threaded environment (sidekiq).

A ThreadError occurred in background at 2014-01-10 07:26:44 UTC :

  Attempt to unlock a mutex which is not locked
  /home/ubuntu/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/thread.rb:188:in `synchronize'

  -------------------------------
Backtrace:
-------------------------------

  /home/ubuntu/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/thread.rb:188:in `synchronize'
  /home/ubuntu/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/thread.rb:188:in `block in pop'
  /home/ubuntu/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/thread.rb:187:in `handle_interrupt'
  /home/ubuntu/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/thread.rb:187:in `pop'
  /home/ubuntu/.rvm/gems/ruby-2.0.0-p247@spaghetti/ruby/2.0.0/gems/cql-rb-1.1.3/lib/cql/future.rb:344:in `value'
  /home/ubuntu/.rvm/gems/ruby-2.0.0-p247@spaghetti/ruby/2.0.0/gems/cql-rb-1.1.3/lib/cql/client/synchronous_client.rb:54:in `block in execute'
  /home/ubuntu/.rvm/gems/ruby-2.0.0-p247@spaghetti/ruby/2.0.0/gems/cql-rb-1.1.3/lib/cql/client/synchronous_client.rb:8:in `synchronous_backtrace'
  /home/ubuntu/.rvm/gems/ruby-2.0.0-p247@spaghetti/ruby/2.0.0/gems/cql-rb-1.1.3/lib/cql/client/synchronous_client.rb:54:in `execute'
iconara commented 10 years ago

I'm not sure this is a bug in cql-rb, it looks like it's completely inside Queue#pop. The mutex that complains is completely owned by the Queue instance. And even more strangely the error is raised by Mutex#synchronize, which should guarantee that the locking and unlocking is done safely.

Is there any part of this application that forks? Does it run in a preforking server like Unicorn or Passenger? Forking could explain the issue, but if your server forks you should have seen issues with closed sockets too.

iconara commented 10 years ago

(this is just some notes that can be useful tracking this down later)

This is the line that breaks: https://github.com/ruby/ruby/blob/v2_0_0_247/lib/thread.rb#L188

Mutex#synchronize: https://github.com/ruby/ruby/blob/v2_0_0_247/thread.c#L4517 Mutex#sleep: https://github.com/ruby/ruby/blob/v2_0_0_247/thread.c#L4468

The chain of calls going to the mutex should look like:

The error could be raised form one of two places: either when the condition variable calls Mutex#sleep, or when Mutex#synchronize exits. It looks to me like it's the latter since the error comes from the line that calls #synchronize and not from the line that calls #sleep.

There is a window of opportunity for race conditions in that rb_thread_run can be called (from another thread) when the thread is in rb_mutex_sleep and has released the lock, but before it has called rb_sleep_forever. On the other hand, wouldn't the ensure handler (i.e. rb_mutex_lock) be called regardless? The sleep would be a no-op, but it would still lock the mutex why would the mutex remain unlocked?

bashcoder commented 10 years ago

Yes, Resque does fork. The main process watches Redis for queued jobs and forks a child process for each job it runs. That has caused thread-safety issues for me in the past, and I have to make sure that forks don't try to reuse any inherited TCP connections. This is one of the reasons some choose Sidekiq instead, due to its use of threads instead of forks.

iconara commented 10 years ago

@bashcoder Resque? He's using Sidekiq.

bashcoder commented 10 years ago

Apparently I've been drinking! Wow... :-)

rajofchennai commented 10 years ago

Hey I am still trying to debug this. I get a different sort of error.

A ThreadError occurred in background at 2014-01-08 12:30:56 UTC :

  Attempt to unlock a mutex which is locked by another thread
  /home/ubuntu/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/thread.rb:188:in `synchronize'

  -------------------------------
Backtrace:
-------------------------------

  /home/ubuntu/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/thread.rb:188:in `synchronize'
  /home/ubuntu/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/thread.rb:188:in `block in pop'
  /home/ubuntu/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/thread.rb:187:in `handle_interrupt'
  /home/ubuntu/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/thread.rb:187:in `pop'
/home/ubuntu/.rvm/gems/ruby-2.0.0-p247@spaghetti/ruby/2.0.0/gems/cql-rb-1.1.3/lib/cql/future.rb:344:in `value'
  /home/ubuntu/.rvm/gems/ruby-2.0.0-p247@spaghetti/ruby/2.0.0/gems/cql-rb-1.1.3/lib/cql/client/synchronous_client.rb:54:in `block in execute'
  /home/ubuntu/.rvm/gems/ruby-2.0.0-p247@spaghetti/ruby/2.0.0/gems/cql-rb-1.1.3/lib/cql/client/synchronous_client.rb:8:in `synchronous_backtrace'
  /home/ubuntu/.rvm/gems/ruby-2.0.0-p247@spaghetti/ruby/2.0.0/gems/cql-rb-1.1.3/lib/cql/client/synchronous_client.rb:54:in `execute'

I thought If the semaphore is locked by another thread, the current thread should block and not throw an error.

iconara commented 10 years ago

Yeah, there's not much that's making sense with this. At this point I think there's another error that is being masked by the mutex logic. One possible way for this to happen is that an error is raised when the thread attempts to take the lock when it is woken up. This error is then masked by the error you see. Since this happens deep down in the C code it will be very hard to debug without knowing your way around MRI (which I don't).

Here's a rough translation of what I think happens, translated into pseudo-Ruby:

begin
  # Queue#pop calls Mutex#synchronize
  # the synchronize block takes the lock
  lock_mutex
  # then yields back to Queue#pop again, which does it's work
  # and eventually calls ConditionVariable#wait
  # which unlocks the mutex
  unlock_mutex
  begin
    # and puts the thread to sleep
    sleep
    # when we wake up (because of Queue#push -> ConditionVariable#signal -> Thread#run)
  ensure
    # we end up in the ensure block, trying to take the lock again
    lock_mutex
    # and this is where I tink it goes wrong
    # if an error is raised here
  end
  # code here will not run
ensure
  # and we'll en up in Mutex#synchronize's ensure block
  # where the lock is not held by this thread
  unlock_mutex
  # and the ThreadError is raised, masking the real error
end

I can't say for sure this is what happens, but it looks to me like a possibility. I don't know what error could be raised in the first ensure (out of memory is one possibility, but there must be others), but I think at least that the scenario I've described is possible. I'll try to see if I can get any tips on how to discover if this is what actually happens, and if so what the underlying error is.

iconara commented 10 years ago

I haven't been able to figure anything out, apart from what I said in the last comment. I think there's another exception that is being masked by the mutex. The error could be an interrupt, out of memory, or some other kind of system level error. That is unless you're forking the process, in which case that is the problem.

The best advice I can give at this point, unless you're willing to spend more time debugging and trying to determine the underlying error that is masked, is to retry the operations when the error happens.

Another way is to use the experimental asynchronous API, which bypasses the mutex that is causing the error. To use the async API you can just add async. before execute. You'll get a future back, and not a value, so you need to attach a listener to the future to get the value. You'll also need to do locking and synchronization yourself:

future = client.async.execute('SELECT * FROM foo')
future.on_value do |result|
  # result is the same as what the non-async call would return
  # this block is called asynchronously, so you'll need to
  # make sure you pass the result in a thread safe manner
  # also: you must absolutely not do any blocking operations
  # because that will block the IO reactor and no data will be
  # sent to or received from Cassandra
  # USE AT YOUR OWN RISK AND READ THE DOCS IN THE CODE
end
iconara commented 10 years ago

I'm closing this because of lack of input. I can't reproduce.

sodabrew commented 9 years ago

Same problem has been reported upstream. If you have any more information that can help upstream Ruby devs get a handle on the problem, please post to: https://bugs.ruby-lang.org/issues/9132 https://bugs.ruby-lang.org/issues/9432