ruby-concurrency / concurrent-ruby

Modern concurrency tools including agents, futures, promises, thread pools, supervisors, and more. Inspired by Erlang, Clojure, Scala, Go, Java, JavaScript, and classic concurrency patterns.
https://ruby-concurrency.github.io/concurrent-ruby/
Other
5.68k stars 418 forks source link

Ruby locking is per-fiber, not per-thread. #962

Closed ioquatix closed 1 year ago

ioquatix commented 1 year ago

Ruby mutex implementation is per-fiber and generally speaking, all locking mechanisms should be per-fiber for the purpose of mutual exclusion.

As such, it appears at least some parts of concurrent ruby might not be correct, according to this specification.

I've been looking at https://github.com/ruby-concurrency/concurrent-ruby/blob/385156117c110808875e1e87d5267147f696f796/lib/concurrent-ruby/concurrent/atomic/reentrant_read_write_lock.rb#L51 because we had some reports here https://github.com/socketry/falcon/issues/166

Specifically

      "kind":"Concurrent::IllegalOperationError",
      "message":"Cannot release a read lock which is not held",

It appears that multi-fiber re-entrancy is not supported or handled correctly, or used incorrectly.

I'm not sure what the correct solution is here, but at least we can start having a discussion about how it should work and what the solution is.

chrisseaton commented 1 year ago

This is since Ruby 3.0 isn't it?

Not massively looking to working through the significance of this change...

Would you be willing to start by working with me to redefine what you think the specs should be now?

ioquatix commented 1 year ago

IIRC, different versions of Ruby did different things and 3.0 standardised the behaviour. On some platforms where fibers were backed by threads, mutex was already per-fiber.

I checked the implementation a little bit and it looks okay to me, you are using Mutex correctly.

Maybe it's a genuine usage error on the part of rails, let me check their code.

ioquatix commented 1 year ago

I've updated the original discussion of this issue, because I'm no longer sure that the issue is with this code.

ioquatix commented 1 year ago

This seems to reproduce the original issue, at least the error is the same:

require 'concurrent'

lock = Concurrent::ReentrantReadWriteLock.new
lock.acquire_read_lock

puts "Lockity lock!"

Thread.new do
  lock.release_read_lock
end.join

The question is, is this what's happening?

chrisseaton commented 1 year ago

Cannot release a read lock which is not held

That thread doesn't hold the lock so can't release it.

ioquatix commented 1 year ago

Yes, I understand that, but why is it not happening on the same thread?

chrisseaton commented 1 year ago

Are you saying you're aware that the last code snippet you posted is broken - you're saying that something else in some other code is making the same mistake as this example code, and that's what you're trying to fix?

Ok I think we're on the same page now. Yeah sounds like an application bug.

You could probably add some error-handling code to say which thread is holding it and which is trying to release it (might take some extra book-keeping) - I'd try that.

ioquatix commented 1 year ago

Sorry about my lack of clarity, I don't really know what's going on fully yet.

Are you saying you're aware that the last code snippet you posted is broken - you're saying that something else in some other code is making the same mistake as this example code, and that's what you're trying to fix?

Yes, I'm trying to reproduce possible scenarios which could create the behaviour I'm seeing so I know what to look for.

Ok I think we're on the same page now. Yeah sounds like an application bug.

Yes, maybe in Rails or maybe the way Rails is being invoked by Falcon.

You could probably add some error-handling code to say which thread is holding it and which is trying to release it (might take some extra book-keeping) - I'd try that.

Yes, that's the plan. I don't have a full repro, but I might try hacking something together.

ioquatix commented 1 year ago

Okay, I reviewed all the code and was able to make what appears to be a repro:

#!/usr/bin/env ruby

require 'concurrent'
require 'async'

lock = Concurrent::ReentrantReadWriteLock.new
Q = 0.01

watcher = Thread.new do
  loop do
    lock.with_write_lock do
      sleep Q
    end
    sleep Q
  end
end

Async do |task|
  2.times do |i|
    task.async do
      loop do
        lock.with_read_lock do
          sleep Q
        end
        sleep Q
      end
    end
  end
end

Results in:

0>1>0-1-  0.0s     warn: Async::Task [oid=0xa0] [ec=0xb4] [pid=857111] [2022-10-18 13:43:20 +1300]
               | Task may have ended with unhandled exception.
               |   Concurrent::IllegalOperationError: Cannot release a read lock which is not held
               |   → /home/samuel/.gem/ruby/3.1.2/gems/concurrent-ruby-1.1.10/lib/concurrent-ruby/concurrent/atomic/reentrant_read_write_lock.rb:244 in `release_read_lock'
               |     /home/samuel/.gem/ruby/3.1.2/gems/concurrent-ruby-1.1.10/lib/concurrent-ruby/concurrent/atomic/reentrant_read_write_lock.rb:130 in `with_read_lock'
               |     bug.rb:23 in `block (4 levels) in <main>'
               |     bug.rb:21 in `loop'
               |     bug.rb:21 in `block (3 levels) in <main>'
               |     /home/samuel/.gem/ruby/3.1.2/gems/async-2.2.1/lib/async/task.rb:107 in `block in run'
               |     /home/samuel/.gem/ruby/3.1.2/gems/async-2.2.1/lib/async/task.rb:243 in `block in schedule'

After that the lock is totally jammed up.

ioquatix commented 1 year ago

Okay, I think I have an idea of the problem.

The thread local variable is per-thread, while the synchronisation is per fiber. I think this must be leading to some kind of invalid state.

ioquatix commented 1 year ago

Okay, that appears to be the problem. The fix in this case is to use fiber local variables:

require 'thread'
require 'concurrent/atomic/abstract_thread_local_var'

module Concurrent

  # @!visibility private
  # @!macro internal_implementation_note
  class RubyThreadLocalVar < AbstractThreadLocalVar
    def initialize(...)
      super
      @key = :"concurrent-ruby-#{object_id}"
    end

    def value
      Thread.current[@key] || default
    end

    def value=(value)
      Thread.current[@key] = value
    end

    def allocate_storage
      # No-op.
    end
  end
end

I just monkey patched this into my local gem and it fixed the issue.

The current implementation of thread locals looks a bit... over engineered? Maybe it was that way for backwards compatibility?

In any case, the locking is per fiber, and the state used for tracking that locking is per-thread, so any time you mix the two, it's bound to eventually end in disaster. The solution is to use per-fiber state for the lock implementation.

ioquatix commented 1 year ago

@chrisseaton any thoughts on the next steps?

chrisseaton commented 1 year ago

I'm a bit worried about how deep this goes.

Yes likely complexity is about backwards-compatibility. I'd like to add a simple core set of primitives to MRI for the long-term but that's another discussion...

the locking is per fiber, and the state used for tracking that locking is per-thread

That's probably the central thing to think about. I'd like to be able to write a spec that we want to meet. Would you be able to help me with that? Even just the plain-English description of the behaviour you want.

ioquatix commented 1 year ago

Even just the plain-English description of the behaviour you want.

(It's not what I want, it's what's defined by CRuby).

Mutex locking is per-fiber.

Therefore, if you are storing state "per lock" you need to store it per fiber.

That's what the above monkey patch does.

ioquatix commented 1 year ago

@eregon can I please pull your expertise into this discussion.

eregon commented 1 year ago

I think for ReentrantReadWriteLock and others which use ThreadLocalVar, we should use FiberLocalVar (to be added, https://github.com/ruby-concurrency/concurrent-ruby/pull/376#issuecomment-126322710) if Mutex is per Fiber. That can be easily detected with

m = Mutex.new
mutex_owned_per_thread = m.synchronize { Fiber.new { m.owned? }.resume }

I think then the compatibility concerns are non-existent because using a ThreadLocalVar for per-Mutex-state on 3.0+ is just a bug.

chrisseaton commented 1 year ago

I don't have any objection to this fix if that's what you're thinking by the way. Just moving cautiously on a subject that is usually more complex than it appears.

eregon commented 1 year ago

https://github.com/ruby-concurrency/concurrent-ruby/releases/tag/v1.2.0

ioquatix commented 1 year ago

Thanks!

hoprocker commented 1 year ago

Hey folks, we just came across an issue with v1.2.0 in JRuby that seems to be related to this. Our code uses ReentrantReadWriteLock; it worked fined under v1.1.10, but in v1.2.0 we're now seeing this error:

undefined method `current' for Fiber:Class
Did you mean?  __current__

rubygems/concurrent-ruby-1.2.0/lib/concurrent-ruby/concurrent/atomic/locals.rb:180:in `locals!'"
rubygems/concurrent-ruby-1.2.0/lib/concurrent-ruby/concurrent/atomic/locals.rb:102:in `set',
rubygems/concurrent-ruby-1.2.0/lib/concurrent-ruby/concurrent/atomic/fiber_local_var.rb:77:in `value=',
rubygems/concurrent-ruby-1.2.0/lib/concurrent-ruby/concurrent/atomic/reentrant_read_write_lock.rb:205:in `acquire_read_lock',
rubygems/concurrent-ruby-1.2.0/lib/concurrent-ruby/concurrent/atomic/reentrant_read_write_lock.rb:128:in `with_read_lock',
...

link

At first glance it looks like this could be mitigated by switching from Fiber.current to Thread.current with some sort of environment evaluation, ie

current_ctx = (RUBY_PLATFORM == 'java' ? Thread.current : Fiber.current)
ObjectSpace.define_finalizer(current_ctx, thread_fiber_finalizer(locals.object_id))

However, I have no idea how much deeper this goes, if this would need to be swapped out in multiple places, etc.

ioquatix commented 1 year ago

The above PR should fix this issue.

hoprocker commented 1 year ago

Somehow totally missed that, thank you!

eregon commented 1 year ago

The missing require only affects Ruby <= 3.0, 3.1+ always have Fiber.current defined. And the reason we missed this is rake spec:isolated is only run 3.2 currently in CI, I'll also run it on the oldest Ruby we support to help catch this.

eregon commented 1 year ago

Released in https://github.com/ruby-concurrency/concurrent-ruby/releases/tag/v1.2.1