rspec / rspec-support

Common code needed by the other RSpec gems. Not intended for direct use.
https://rspec.info
MIT License
96 stars 104 forks source link

ThreadError: Attempt to unlock a mutex which is locked by another thread/fiber #501

Closed ioquatix closed 3 years ago

ioquatix commented 3 years ago

Subject of the issue

image

Your environment

Steps to reproduce

This is a tricky issue to reproduce.

Essentially it looks like some of the assumptions made by reentrant_mutex.rb are no longer true in Ruby 3.0. I will review the code and try to give an update, but essentially it looks like trying to lock a ReentrantMutex from a different fiber on the same thread might be a problem.

ioquatix commented 3 years ago

On Ruby 3 here is an example of how to produce the issue:

def mutex_failure
    m = Mutex.new

    f1 = Fiber.new do
        m.lock
    end

    f2 = Fiber.new do
        m.unlock
    end

    f1.resume
    f2.resume
end

mutex_failure

# ./test.rb:39:in `unlock': Attempt to unlock a mutex which is locked by another thread/fiber (ThreadError)
#   from ./test.rb:39:in `block in mutex_failure'
ioquatix commented 3 years ago

Okay, with that basic semantic, I managed to figure out the sequence of events which cause this to fail:

#!/usr/bin/env ruby

require 'rspec/support/reentrant_mutex'

def rspec_failure
    m = RSpec::Support::ReentrantMutex.new

    f1 = Fiber.new do
        m.synchronize do
            puts "f1 A"
            Fiber.yield
            puts "f1 B"
        end
    end

    f2 = Fiber.new do
        m.synchronize do
            puts "f2 A"
            # Fiber.yield
            f1.resume
            puts "f2 B"
        end
    end

    f1.resume
    f2.resume
    f2.resume
end

rspec_failure

gives

f1 A
f2 A
f1 B
f2 B
/Users/samuel/.gem/ruby/3.0.0/gems/rspec-support-3.10.2/lib/rspec/support/reentrant_mutex.rb:40:in `unlock': Attempt to unlock a mutex which is locked by another thread/fiber (ThreadError)
    from /Users/samuel/.gem/ruby/3.0.0/gems/rspec-support-3.10.2/lib/rspec/support/reentrant_mutex.rb:40:in `exit'
    from /Users/samuel/.gem/ruby/3.0.0/gems/rspec-support-3.10.2/lib/rspec/support/reentrant_mutex.rb:25:in `synchronize'
    from ./test.rb:17:in `block in rspec_failure'
ioquatix commented 3 years ago

cc @eregon I believe we discussed that there should be no use case where a mutex is locked from one fiber and released on another, but maybe this is a valid use case? Does this mean we should relax the condition, or can you see some other solution?

ioquatix commented 3 years ago
#!/usr/bin/env ruby

require 'monitor'

def monitor_failure
    m = Monitor.new

    f1 = Fiber.new do
        m.synchronize do
            puts "f1 A"
            Fiber.yield
            puts "f1 B"
        end
    end

    f2 = Fiber.new do
        m.synchronize do
            puts "f2 A"
            # Fiber.yield
            f1.resume
            puts "f2 B"
        end
    end

    f1.resume
    f2.resume
    f2.resume
end

monitor_failure

Maybe not a general bug with RSpec, we should consider moving to bugs.ruby-lang.org.

eregon commented 3 years ago

The m.synchronize should probably block (forever in this example) because f1 has the monitor. Might a bug of the monitor stdlib. if you replace Monitor.new with Mutex.new then CRuby detects a deadlock (correct behavior), and would otherwise wait if it does not detect the deadlock.

The example code here does have an intrinsic problem: a mutex/monitor section is supposed to be atomic but here the atomicity is broken by switching fibers (specifically f1 A happens without f1 B, a data structure could become corrupted if those puts are replaced by mutating state/ivars).

eregon commented 3 years ago

Fiber.yield inside synchronize is probably best to avoid in general, but I can imagine that's not always possible.

eregon commented 3 years ago

After more thoughts, this example does not acquire the lock in a reentrant way (=on the same execution stack, fibers have different stacks), so it should behave exactly the same as Mutex, and the difference is a bug of Monitor (and ReentrantMutex).

So it should raise a deadlock error, or if there is a Fiber scheduler active then schedule another Fiber.

eregon commented 3 years ago

@ioquatix Could you create an issue on bugs.ruby-lang.org?

ioquatix commented 3 years ago

@eregon https://bugs.ruby-lang.org/issues/17827

ioquatix commented 3 years ago

The solution here is to change Thread.current to Fiber.current right?

ioquatix commented 3 years ago

Okay so here is the initial problem:

image

I tried adding the following code as a hack:

require 'rspec/support/reentrant_mutex'
require 'fiber'

module RSpec
    module Support
        class ReentrantMutex
            def enter
                @mutex.lock if @owner != Fiber.current
                @owner = Fiber.current
                @count += 1
            end
        end
    end
end

Now I get other issues I need to figure out.

ioquatix commented 3 years ago

Okay, I actually think using Fiber.current works correctly.

pirj commented 3 years ago

By the way, on Ruby 2.7.2 I'm getting a dead fiber called (FiberError) instead of Attempt to unlock a mutex which is locked by another thread/fiber. This is exactly what I would expect from the second f2.resume call. Is the second call necessary? On Ruby 3.0 it fails without the second f2.resume.

And I don't see that this situation changes if we replace Thread.current with Fiber.current in ReentrantMutex.

To illustrate this:

require 'fiber'

class ReentrantMutex
  def initialize
    @owner = nil
    @count = 0
    @mutex = Mutex.new
  end

  def synchronize
    enter
    yield
  ensure
    exit
  end

private

  def enter
    @mutex.lock if @owner != Fiber.current
    @owner = Fiber.current
    @count += 1
  end

  def exit
    @count -= 1
    return unless @count == 0
    @owner = nil
    @mutex.unlock
  end
end

m = ReentrantMutex.new

f1 = Fiber.new do
  m.synchronize do
    puts "f1 A"
    Fiber.yield
    puts "f1 B"
  end
  puts "f1 finish"
end

f2 = Fiber.new do
  m.synchronize do
    puts "f2 A"
    # Fiber.yield
    f1.resume
    puts "f2 B"
  end
  puts "f2 finish"
end

f1.resume
f2.resume

With Fiber.current:

$ ruby --version
ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-darwin18]
$ ruby 1.rb
f1 A
1.rb:29:in `unlock': Attempt to unlock a mutex which is locked by another thread/fiber (ThreadError)
        from 1.rb:29:in `exit'
        from 1.rb:14:in `ensure in synchronize'
        from 1.rb:14:in `synchronize'
        from 1.rb:45:in `block in <main>'
1.rb:20:in `lock': No live threads left. Deadlock? (fatal)
1 threads, 1 sleeps current:0x00007fec1b409480 main thread:0x00007fec1b409480
* #<Thread:0x00007fec1c07c0f0 sleep_forever>
   rb_thread_t:0x00007fec1b409480 native:0x000000010ca9f5c0 int:0

        from 1.rb:20:in `enter'
        from 1.rb:11:in `synchronize'
        from 1.rb:45:in `block in <main>'
$ ruby --version
ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-darwin18]
$ ruby 1.rb
f1 A
Traceback (most recent call last):
        3: from 1.rb:45:in `block in <main>'
        2: from 1.rb:11:in `synchronize'
        1: from 1.rb:20:in `enter'
1.rb:20:in `lock': deadlock; recursive locking (ThreadError)

With Thread.current:

$ ruby --version
ruby 3.0.0p0 (2020-12-25 revision 95aff21468) [x86_64-darwin18]
$ ruby 1.rb
f1 A
f2 A
f1 B
f1 finish
f2 B
1.rb:29:in `unlock': Attempt to unlock a mutex which is locked by another thread/fiber (ThreadError)
        from 1.rb:29:in `exit'
        from 1.rb:14:in `synchronize'
        from 1.rb:45:in `block in <main>'
$ ruby --version
ruby 2.7.1p83 (2020-03-31 revision a0c7c23c9c) [x86_64-darwin18]
$ ruby 1.rb
f1 A
f2 A
f1 B
f1 finish
f2 B
f2 finish

So from my point of view, replacing Thread.current with Fiber.current:

Am I missing something?

ioquatix commented 3 years ago

The piece you are missing is that with a fiber scheduler, locking a mutex on one fiber will allow other fibers to be scheduled.

pirj commented 3 years ago

Good point. I struggle to understand it from rdoc what is the default behaviour for the default fiber scheduler, is it non-blocking or blocking?

ioquatix commented 3 years ago

There is no default fiber scheduler, trying to lock a mutex, transferring to a different fiber and then trying to take the lock again is an error without a scheduler in place.

pirj commented 3 years ago

Also, speaking of re-entrance, it's puzzling why we nullify @owner. In a scenario like:

m.synchronize do
  m.synchronize { ... } # ok, re-entry of the same fiber
  Fiber.yield # other fiber locks the mutex. if the scheduled and the fiber are non-blocking, we proceed to the next line
  m.synchronize { ... } # BOOM, we lock again, because the `@owner` is the other fiber now
end
pirj commented 3 years ago

transferring to a different fiber and then trying to take the lock again is an error without a scheduler in place

We don't set_scheduler in the snippet above. Is Attempt to unlock a mutex which is locked by another thread/fiber an expected error in this case?

The sequence is:

Abstracting out from ReentrantMutex implementation, I would expect:

  1. for non-blocking scheduler/fibers the control to be returned to f1. When in releases the lock, f2.resume would resume it. No errors.
  2. for blocking, this is a deadlock, and that highlights the point that it's weird to call yield inside synchronize
eregon commented 3 years ago

The owner is only reset if count == 0, so it's not reset in your example: https://github.com/rspec/rspec-support/blob/d63133f478408c1d965e673b96ad10ef5a5d183f/lib/rspec/support/reentrant_mutex.rb#L38-L39

Your example is probably just a variant of @ioquatix's, Attempt to unlock a mutex which is locked by another thread/fiber is a bug of ReentrantMutex is lock/unlock are paired (or synchronize is used).

https://github.com/rspec/rspec-support/pull/502#issuecomment-827075412 is the proper fix.

ioquatix commented 3 years ago

Just to follow up with everyone here, https://bugs.ruby-lang.org/issues/17827 was back ported and should be in the next 3.0.x release.

eregon commented 1 year ago

@ioquatix See https://github.com/rspec/rspec-support/pull/553, maybe you could help to write a test or describe how to that represents the Fiber scheduler case + ReentrantLock? The current test doesn't use a Fiber scheduler and so doesn't really test anything interesting on Ruby 3.2+.

ioquatix commented 1 year ago

Thanks, I can take a look.