Closed ioquatix closed 1 year ago
This is since Ruby 3.0 isn't it?
Not massively looking to working through the significance of this change...
Would you be willing to start by working with me to redefine what you think the specs should be now?
IIRC, different versions of Ruby did different things and 3.0 standardised the behaviour. On some platforms where fibers were backed by threads, mutex was already per-fiber.
I checked the implementation a little bit and it looks okay to me, you are using Mutex
correctly.
Maybe it's a genuine usage error on the part of rails, let me check their code.
I've updated the original discussion of this issue, because I'm no longer sure that the issue is with this code.
This seems to reproduce the original issue, at least the error is the same:
require 'concurrent'
lock = Concurrent::ReentrantReadWriteLock.new
lock.acquire_read_lock
puts "Lockity lock!"
Thread.new do
lock.release_read_lock
end.join
The question is, is this what's happening?
Cannot release a read lock which is not held
That thread doesn't hold the lock so can't release it.
Yes, I understand that, but why is it not happening on the same thread?
Are you saying you're aware that the last code snippet you posted is broken - you're saying that something else in some other code is making the same mistake as this example code, and that's what you're trying to fix?
Ok I think we're on the same page now. Yeah sounds like an application bug.
You could probably add some error-handling code to say which thread is holding it and which is trying to release it (might take some extra book-keeping) - I'd try that.
Sorry about my lack of clarity, I don't really know what's going on fully yet.
Are you saying you're aware that the last code snippet you posted is broken - you're saying that something else in some other code is making the same mistake as this example code, and that's what you're trying to fix?
Yes, I'm trying to reproduce possible scenarios which could create the behaviour I'm seeing so I know what to look for.
Ok I think we're on the same page now. Yeah sounds like an application bug.
Yes, maybe in Rails or maybe the way Rails is being invoked by Falcon.
You could probably add some error-handling code to say which thread is holding it and which is trying to release it (might take some extra book-keeping) - I'd try that.
Yes, that's the plan. I don't have a full repro, but I might try hacking something together.
Okay, I reviewed all the code and was able to make what appears to be a repro:
#!/usr/bin/env ruby
require 'concurrent'
require 'async'
lock = Concurrent::ReentrantReadWriteLock.new
Q = 0.01
watcher = Thread.new do
loop do
lock.with_write_lock do
sleep Q
end
sleep Q
end
end
Async do |task|
2.times do |i|
task.async do
loop do
lock.with_read_lock do
sleep Q
end
sleep Q
end
end
end
end
Results in:
0>1>0-1- 0.0s warn: Async::Task [oid=0xa0] [ec=0xb4] [pid=857111] [2022-10-18 13:43:20 +1300]
| Task may have ended with unhandled exception.
| Concurrent::IllegalOperationError: Cannot release a read lock which is not held
| → /home/samuel/.gem/ruby/3.1.2/gems/concurrent-ruby-1.1.10/lib/concurrent-ruby/concurrent/atomic/reentrant_read_write_lock.rb:244 in `release_read_lock'
| /home/samuel/.gem/ruby/3.1.2/gems/concurrent-ruby-1.1.10/lib/concurrent-ruby/concurrent/atomic/reentrant_read_write_lock.rb:130 in `with_read_lock'
| bug.rb:23 in `block (4 levels) in <main>'
| bug.rb:21 in `loop'
| bug.rb:21 in `block (3 levels) in <main>'
| /home/samuel/.gem/ruby/3.1.2/gems/async-2.2.1/lib/async/task.rb:107 in `block in run'
| /home/samuel/.gem/ruby/3.1.2/gems/async-2.2.1/lib/async/task.rb:243 in `block in schedule'
After that the lock is totally jammed up.
Okay, I think I have an idea of the problem.
The thread local variable is per-thread, while the synchronisation is per fiber. I think this must be leading to some kind of invalid state.
Okay, that appears to be the problem. The fix in this case is to use fiber local variables:
require 'thread'
require 'concurrent/atomic/abstract_thread_local_var'
module Concurrent
# @!visibility private
# @!macro internal_implementation_note
class RubyThreadLocalVar < AbstractThreadLocalVar
def initialize(...)
super
@key = :"concurrent-ruby-#{object_id}"
end
def value
Thread.current[@key] || default
end
def value=(value)
Thread.current[@key] = value
end
def allocate_storage
# No-op.
end
end
end
I just monkey patched this into my local gem and it fixed the issue.
The current implementation of thread locals looks a bit... over engineered? Maybe it was that way for backwards compatibility?
In any case, the locking is per fiber, and the state used for tracking that locking is per-thread, so any time you mix the two, it's bound to eventually end in disaster. The solution is to use per-fiber state for the lock implementation.
@chrisseaton any thoughts on the next steps?
I'm a bit worried about how deep this goes.
Yes likely complexity is about backwards-compatibility. I'd like to add a simple core set of primitives to MRI for the long-term but that's another discussion...
the locking is per fiber, and the state used for tracking that locking is per-thread
That's probably the central thing to think about. I'd like to be able to write a spec that we want to meet. Would you be able to help me with that? Even just the plain-English description of the behaviour you want.
Even just the plain-English description of the behaviour you want.
(It's not what I want, it's what's defined by CRuby).
Mutex locking is per-fiber.
Therefore, if you are storing state "per lock" you need to store it per fiber.
That's what the above monkey patch does.
@eregon can I please pull your expertise into this discussion.
I think for ReentrantReadWriteLock and others which use ThreadLocalVar, we should use FiberLocalVar (to be added, https://github.com/ruby-concurrency/concurrent-ruby/pull/376#issuecomment-126322710) if Mutex is per Fiber. That can be easily detected with
m = Mutex.new
mutex_owned_per_thread = m.synchronize { Fiber.new { m.owned? }.resume }
I think then the compatibility concerns are non-existent because using a ThreadLocalVar for per-Mutex-state on 3.0+ is just a bug.
I don't have any objection to this fix if that's what you're thinking by the way. Just moving cautiously on a subject that is usually more complex than it appears.
Thanks!
Hey folks, we just came across an issue with v1.2.0 in JRuby that seems to be related to this. Our code uses ReentrantReadWriteLock; it worked fined under v1.1.10, but in v1.2.0 we're now seeing this error:
undefined method `current' for Fiber:Class
Did you mean? __current__
rubygems/concurrent-ruby-1.2.0/lib/concurrent-ruby/concurrent/atomic/locals.rb:180:in `locals!'"
rubygems/concurrent-ruby-1.2.0/lib/concurrent-ruby/concurrent/atomic/locals.rb:102:in `set',
rubygems/concurrent-ruby-1.2.0/lib/concurrent-ruby/concurrent/atomic/fiber_local_var.rb:77:in `value=',
rubygems/concurrent-ruby-1.2.0/lib/concurrent-ruby/concurrent/atomic/reentrant_read_write_lock.rb:205:in `acquire_read_lock',
rubygems/concurrent-ruby-1.2.0/lib/concurrent-ruby/concurrent/atomic/reentrant_read_write_lock.rb:128:in `with_read_lock',
...
At first glance it looks like this could be mitigated by switching from Fiber.current
to Thread.current
with some sort of environment evaluation, ie
current_ctx = (RUBY_PLATFORM == 'java' ? Thread.current : Fiber.current)
ObjectSpace.define_finalizer(current_ctx, thread_fiber_finalizer(locals.object_id))
However, I have no idea how much deeper this goes, if this would need to be swapped out in multiple places, etc.
The above PR should fix this issue.
Somehow totally missed that, thank you!
The missing require only affects Ruby <= 3.0, 3.1+ always have Fiber.current
defined.
And the reason we missed this is rake spec:isolated
is only run 3.2 currently in CI, I'll also run it on the oldest Ruby we support to help catch this.
Ruby mutex implementation is per-fiber and generally speaking, all locking mechanisms should be per-fiber for the purpose of mutual exclusion.
As such, it appears at least some parts of concurrent ruby might not be correct, according to this specification.
I've been looking at https://github.com/ruby-concurrency/concurrent-ruby/blob/385156117c110808875e1e87d5267147f696f796/lib/concurrent-ruby/concurrent/atomic/reentrant_read_write_lock.rb#L51 because we had some reports here https://github.com/socketry/falcon/issues/166
Specifically
It appears that multi-fiber re-entrancy is not supported or handled correctly, or used incorrectly.
I'm not sure what the correct solution is here, but at least we can start having a discussion about how it should work and what the solution is.