Closed rajofchennai closed 10 years ago
I'm not sure this is a bug in cql-rb, it looks like it's completely inside Queue#pop
. The mutex that complains is completely owned by the Queue
instance. And even more strangely the error is raised by Mutex#synchronize
, which should guarantee that the locking and unlocking is done safely.
Is there any part of this application that forks? Does it run in a preforking server like Unicorn or Passenger? Forking could explain the issue, but if your server forks you should have seen issues with closed sockets too.
(this is just some notes that can be useful tracking this down later)
This is the line that breaks: https://github.com/ruby/ruby/blob/v2_0_0_247/lib/thread.rb#L188
Mutex#synchronize
: https://github.com/ruby/ruby/blob/v2_0_0_247/thread.c#L4517
Mutex#sleep
: https://github.com/ruby/ruby/blob/v2_0_0_247/thread.c#L4468
The chain of calls going to the mutex should look like:
Queue#pop
calls #synchronize
which calls rb_mutex_lock
and yieldsConditionVariable#wait
, which calls rb_mutex_sleep
rb_mutex_sleep
calls rb_mutex_unlock
and then rb_mutex_sleep_forever
(and registers rb_mutex_lock
to be called when it wakes up), which calls native_sleep
(which sleeps and handles spurious wakeups from the looks of it)#pop
is asleepQueue#push
, which calls ConditionVariable#signal
, which calls rb_thread_run
, which I assume makes the thread exit its native_sleep
rb_mutex_lock
is called#synchronize
exits and calls rb_mutex_unlock
The error could be raised form one of two places: either when the condition variable calls Mutex#sleep
, or when Mutex#synchronize
exits. It looks to me like it's the latter since the error comes from the line that calls #synchronize
and not from the line that calls #sleep
.
There is a window of opportunity for race conditions in that rb_thread_run
can be called (from another thread) when the thread is in rb_mutex_sleep
and has released the lock, but before it has called rb_sleep_forever
. On the other hand, wouldn't the ensure handler (i.e. rb_mutex_lock
) be called regardless? The sleep would be a no-op, but it would still lock the mutex why would the mutex remain unlocked?
Yes, Resque does fork. The main process watches Redis for queued jobs and forks a child process for each job it runs. That has caused thread-safety issues for me in the past, and I have to make sure that forks don't try to reuse any inherited TCP connections. This is one of the reasons some choose Sidekiq instead, due to its use of threads instead of forks.
@bashcoder Resque? He's using Sidekiq.
Apparently I've been drinking! Wow... :-)
Hey I am still trying to debug this. I get a different sort of error.
A ThreadError occurred in background at 2014-01-08 12:30:56 UTC :
Attempt to unlock a mutex which is locked by another thread
/home/ubuntu/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/thread.rb:188:in `synchronize'
-------------------------------
Backtrace:
-------------------------------
/home/ubuntu/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/thread.rb:188:in `synchronize'
/home/ubuntu/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/thread.rb:188:in `block in pop'
/home/ubuntu/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/thread.rb:187:in `handle_interrupt'
/home/ubuntu/.rvm/rubies/ruby-2.0.0-p247/lib/ruby/2.0.0/thread.rb:187:in `pop'
/home/ubuntu/.rvm/gems/ruby-2.0.0-p247@spaghetti/ruby/2.0.0/gems/cql-rb-1.1.3/lib/cql/future.rb:344:in `value'
/home/ubuntu/.rvm/gems/ruby-2.0.0-p247@spaghetti/ruby/2.0.0/gems/cql-rb-1.1.3/lib/cql/client/synchronous_client.rb:54:in `block in execute'
/home/ubuntu/.rvm/gems/ruby-2.0.0-p247@spaghetti/ruby/2.0.0/gems/cql-rb-1.1.3/lib/cql/client/synchronous_client.rb:8:in `synchronous_backtrace'
/home/ubuntu/.rvm/gems/ruby-2.0.0-p247@spaghetti/ruby/2.0.0/gems/cql-rb-1.1.3/lib/cql/client/synchronous_client.rb:54:in `execute'
I thought If the semaphore is locked by another thread, the current thread should block and not throw an error.
Yeah, there's not much that's making sense with this. At this point I think there's another error that is being masked by the mutex logic. One possible way for this to happen is that an error is raised when the thread attempts to take the lock when it is woken up. This error is then masked by the error you see. Since this happens deep down in the C code it will be very hard to debug without knowing your way around MRI (which I don't).
Here's a rough translation of what I think happens, translated into pseudo-Ruby:
begin
# Queue#pop calls Mutex#synchronize
# the synchronize block takes the lock
lock_mutex
# then yields back to Queue#pop again, which does it's work
# and eventually calls ConditionVariable#wait
# which unlocks the mutex
unlock_mutex
begin
# and puts the thread to sleep
sleep
# when we wake up (because of Queue#push -> ConditionVariable#signal -> Thread#run)
ensure
# we end up in the ensure block, trying to take the lock again
lock_mutex
# and this is where I tink it goes wrong
# if an error is raised here
end
# code here will not run
ensure
# and we'll en up in Mutex#synchronize's ensure block
# where the lock is not held by this thread
unlock_mutex
# and the ThreadError is raised, masking the real error
end
I can't say for sure this is what happens, but it looks to me like a possibility. I don't know what error could be raised in the first ensure
(out of memory is one possibility, but there must be others), but I think at least that the scenario I've described is possible. I'll try to see if I can get any tips on how to discover if this is what actually happens, and if so what the underlying error is.
I haven't been able to figure anything out, apart from what I said in the last comment. I think there's another exception that is being masked by the mutex. The error could be an interrupt, out of memory, or some other kind of system level error. That is unless you're forking the process, in which case that is the problem.
The best advice I can give at this point, unless you're willing to spend more time debugging and trying to determine the underlying error that is masked, is to retry the operations when the error happens.
Another way is to use the experimental asynchronous API, which bypasses the mutex that is causing the error. To use the async API you can just add async.
before execute
. You'll get a future back, and not a value, so you need to attach a listener to the future to get the value. You'll also need to do locking and synchronization yourself:
future = client.async.execute('SELECT * FROM foo')
future.on_value do |result|
# result is the same as what the non-async call would return
# this block is called asynchronously, so you'll need to
# make sure you pass the result in a thread safe manner
# also: you must absolutely not do any blocking operations
# because that will block the IO reactor and no data will be
# sent to or received from Cassandra
# USE AT YOUR OWN RISK AND READ THE DOCS IN THE CODE
end
I'm closing this because of lack of input. I can't reproduce.
Same problem has been reported upstream. If you have any more information that can help upstream Ruby devs get a handle on the problem, please post to: https://bugs.ruby-lang.org/issues/9132 https://bugs.ruby-lang.org/issues/9432
I am getting this error randomly from my application. I am using ruby2.0.0. This happens when I use cql-rb in a threaded environment (sidekiq).