rspec / rspec-core

RSpec runner and formatters
http://rspec.info
MIT License
1.23k stars 765 forks source link

Threadsafe configuration can cause deadlocks #2064

Open 375gnu opened 9 years ago

375gnu commented 9 years ago

Hi.

I working on debian package for berkshelf-api, in debian we have rspec-* 3.3.0 and I have problem to run testsuite because it hangs (on this line: https://github.com/berkshelf/berkshelf-api/blob/master/spec/unit/berkshelf/api/cache_builder/worker/github_spec.rb#L51, https://github.com/berkshelf/berkshelf-api/blob/master/spec/support/human_reaable.rb#L3), While it works perfectly with 2.99.0.

I started to bisect and found that hangs are caused by rspec-core commit ffe00a1d4.

myronmarston commented 9 years ago

Thanks for the clear bug report -- having it bisected down to one commit is a huge help!

@JoshCheek -- this report makes it sound like there may be a deadlock condition in the thread safety stuff you added. Are you able to take a look at this?

JoshCheek commented 9 years ago

Checking it out now. I have it reproduced locally. Difficult to tell what's going on, though, b/c Celluloid is doing some sort of metaprogramming that causes objects to not necessarily be what you think they are, and methods to disappear depending on where you're looking from. I'll spend another hour or two, but really hope I don't have to read all of the Celluloid source code. ...I should probably figure out what this gem actually does :P

JoshCheek commented 9 years ago

CCing @tarcieri, b/c it's not improbable that more people are going to hit a similar thing and report an issue like this.

Summary

The main thread acquires the mutex lock, but the code it is running instantiates a Celluloid actor (think that's the right word) which puts it to sleep and then winds up trying to acquire the same lock.

High level explanation

(note: low level explanation is down below)

Kinda shitty, but easy solutions

Detailed description of the problem

From the main thread, we enter the spec here, which calls subject

it "does not start with Ruby object notation" do
  expect(subject.to_s).to_not start_with("#<")
end

We enter the definition of subject here, (it was overridden here), which calls __memoized.fetch_or_store(:subject)

define_method(name) { __memoized.fetch_or_store(name) { super(&nil) } }

We enter RSpec::Core::MemoizedHelpers::ThreadsafeMemoized#fetch_or_store here, which checks to see if the hash has a value for :subject. It doesn't as this is the first call, so it calls @mutex.synchronize, acquires the lock, checks again to see if it has been calculated (in case it got locked out and slept while some other thread calculated the value), and then yields to the block.

def fetch_or_store(key)
  @memoized.fetch(key) do # only first access pays for synchronization
    @mutex.synchronize do
      @memoized.fetch(key) { @memoized[key] = yield }
    end
  end
end

This takes us back to the block above, which calls super, and hits the method that was defined from the block's body, taking it into the subject block here. Which mocks out Octokit::Client.new to call connection, then calls Berkshelf::API::CacheBuilder::Worker::Github.new.

subject do
  expect(Octokit::Client).to receive(:new) { connection }
  described_class.new(organization: "opscode-cookbooks", access_token: "asdf")
end

This takes it into Celluloid::ClassMethods#new here, because it was extended onto the class here, when Celluloid was included here, and the Worker::Github inherited this behaviour here. Inside Celluloid::ClassMethods#new, it gets an instance of Celluloid::Proxy::Cell here, and sends it :initialize.

def new(*args, &block)
  proxy = Cell.new(allocate, behavior_options, actor_options).proxy
  proxy._send_(:initialize, *args, &block)
  proxy
end

It delegates initialize to method_missing here, which puts the work of calling the real initialize method into some sort of queue and then asks for the value here, which ultimately puts the main thread to sleep here.

def response
  Celluloid.suspend(:callwait, self)
end

def value
  response.value
end

I think a pool eventually pulls it out here or maybe here (I had followed it for a while, but wound up getting lost again, and don't feel like going through it again). At this point, it is in one of Celluloid's threads.

thread = Thread.new do
  while proc = queue.pop
    proc.call
    # ...
  end
end

This eventually winds up calling into Berkshelf::API::CacheBuilder::Worker::Github#initialize here

def initialize(options = {})
  @connection   = Octokit::Client.new(access_token: options[:access_token], auto_paginate: true,
    api_endpoint: options[:api_endpoint], web_endpoint: options[:web_endpoint],
    connection_options: {ssl: {verify: options[:ssl_verify].nil? ? true : options[:ssl_verify]}})

  # ...
end

That goes through some layers of RSpec::Mocks, until it eventually finds its way back to the block defined in the spec, which calls connection, here

subject do
  expect(Octokit::Client).to receive(:new) { connection }
  described_class.new(organization: "opscode-cookbooks", access_token: "asdf")
end

The call to connection goes through the same path as the call to subject, until it eventually hits the memoized helper, tries to synchronize, and is put to sleep because the mutex is locked by the main thread, which is still sleeping, waiting for the current thread to return with the value of subject.

def fetch_or_store(key)
  @memoized.fetch(key) do # only first access pays for synchronization
    @mutex.synchronize do
      @memoized.fetch(key) { @memoized[key] = yield }
    end
  end
end

Sidestep the problem by changing design

Frankly, this code is much better than the vast majority of code that I see. But I get quite excited about testing and design, so I'll share the things I thought about as I considered how you could solve this issue.

There are certain design issues that will make tests painful. In some regard, there will always be pain (if you want to do anything interesting), but the trick is to segregate the painful thing from the rest of the code.

In this case, the pain comes from... well, a lot of things, TBH, but within code you control, it comes from having to mock out singleton methods on global objects. There are any number of ways to get out of having to do this, but the one that seems apparent to me right now is passing the connection in rather than taking all of its arguments and constructing it via the constant:

screenshot 2015-08-20 00 29 24

This is better because it absolves the code from having to know contextual details that it can't always know. It lets the user to choose how it is instantiated, so if they had some need to give it more or different options, then they can easily do so. It also allows them to choose what your code will collaborate with... eg, to introduce a mock connection in a test, or talk to GH using some lib other than Octokit, if it is causing some issue that can't be worked around (we would create a wrapper that exposed the interface this code uses, but implemented it using the other lib).

Now, people generally have this opinion that the lib should have some slim interface, and thus dislike that a user might have to configure Octokit. I generally disagree, and would rather have access to the wiring than an initialization method that does all the work for me (I find such code usually has a high cost and low value when I write it, and usually prevents me from accomplishing my goal when I other people's code that does this as it makes incorrect assumptions about my context). But sometimes there is value in convenience like this. Fortunately, it's very easy to provide both, we just move that convenience code into a different constructor:

screenshot 2015-08-20 00 27 43

Interestingly, this one change fixes the issue, as well, because the Celluloid thread is not created until .new is called, which is now separated from the Octokit instantiation.

The core insight to all of this is that when there is unit test pain, it should be looked at as an indication that there are issues with the design. And the solution is almost always the same: take the cause of the pain and push up the callstack. This was one of probably 3 big realizations that took me forever to discover, but ultimately made testing easy, valuable, and compelling.

Once we push this pain up the callstack, it becomes the test's responsibility to provide the connection, which makes it trivial to mock out without needing to stub methods on singleton objects.

Integration code

Another thing to think about is that this is a wrapper around a 3rd party lib (Octokit), and is right at the boundary of the system. This is a big reason why we wind up with pain of this nature. It is good in that we segregated it in this one worker, so that the rest of the codebase doesn't have to deal with it, but notice how annoying it is to test. Literally every method we call on it is mocked out.

We have very low confidence that it actually works, because we don't control Octokit, but any change to Octokit could break berkshelf while our tests still pass, because our mocked methods still behave as expected. This is why people often say to avoid mocking code you don't own. Ideally, Octokit would provide us a mock client we could use, but I glanced through their code and didn't see one (looks like they use VCR and webmock).

My general experience is that code this close to the boundaries is best tested at the integration level. There are simply too many reasons for it to change, and we control very few of them. We could introduce more abstraction, but there doesn't appear to be much value at this point: the problem code is already pulled out of the rest of the system and into this worker. The code in the worker deals almost entirely with wiring, not with business logic.

So, probably something like VCR would be a good choice to increase the confidence in the integration code and reduce the volatility.

I probably wouldn't put in the effort to do this right now, given that you can also get it up to date by just adding a ! to a let block, but if it winds up being flaky and a total headache, then probably go ginger cake their tests and setup.

myronmarston commented 9 years ago

Wow, amazing, sleuthing @JoshCheek!

I think a simpler fix is to change:

expect(Octokit::Client).to receive(:new) { connection }

...to:

expect(Octokit::Client).to receive(:new).and_return(connection)

This should cause connection to be eagerly evaluated, before the celluloid actor puts the thread to sleep.

Is there something we can do in RSpec to either work around this issue or detect it and warn the user or something?

tarcieri commented 9 years ago

cc @digitalextremist @ioquatix

JoshCheek commented 9 years ago

Wow, amazing, sleuthing @JoshCheek!

I think a simpler fix is to change:

expect(Octokit::Client).to receive(:new) { connection }

...to:

expect(Octokit::Client).to receive(:new).and_return(connection)

This should cause connection to be eagerly evaluated, before the celluloid actor puts the thread to sleep.

Aye, good call.

Is there something we can do in RSpec to either work around this issue or detect it and warn the user or something?

Maybe. I doubt we could tell them what the right thing to do is, but it might be possible to detect the recursive deadlock and give them enough information to resolve it themselves. If the thread is going to ask for the lock from the ReentrantMutex, and the lock is held by one of the thread's ancestors, then we can at least know the deadlock will occur, and possibly give them some useful info about the context.

It's then a question of whether we can know a thread's ancestor threads (I don't see any way to get this info), how expensive it is to track / query, and whether we can get enough information to be useful (ie get the callstacks of the relevant threads).

Interestingly, a deadlock situation like this would normally raise fatal, which we could catch and then try to piece the situation together at that time (rather than every time a let block is calculated). I'm not really sure why this code doesn't, I assume Celluloid has another thread running somewhere.

totallymike commented 9 years ago

Is this a ticket that should still be open?

myronmarston commented 9 years ago

Is this a ticket that should still be open?

I don't believe we've solved it yet.

JoshCheek commented 9 years ago

I guess it depends on your definition of solved. I think we can put some amount of work in to inform a user when they hit this situation, but IDK if it is worth it, unless lots of users wind up hitting this. While I have a desire to get huffy and think "write better code", that doesn't really help anyone or negate the fact that it's systemic. Alternatively, maybe it's not systemic, and showing them how is fine. Anyway, if it pops up a lot, then probably trying to automate the conveyance of useful debugging information is worthwhile. If not, then pointers here and there are probably fine. Ping me next time it happens, I'll evaluate how effortful it would be to automate the deadlock detection (though I have a general opinion that communicating nuanced info like this is very difficult).

jaredjenkins commented 9 years ago

@tarcieri @JoshCheek I think I am running into a similar problem here: https://github.com/square/sqeduler/commit/12babd93e6148e7b1874c219790658f6252694c0. Threads deadlocking if I don't eagerly load the inputs which are provided via let blocks.

JoshCheek commented 9 years ago

Yeah, subject is made threadsafe, so when you call it, it locks other threads out while calculating the subject. However, within the subject calculation, a new thread is made, and then joined, blocking the subject thread, which has the lock. Within that new thread, it accesses wait_time, which needs the lock to be calculated. Thus, they deadlock (subject waiting for the thread to finish, while the thread waits for subject to release the lock).

# Here is a minimal reproduction
# note that `subject` here is a red herring, it is really the same as a let-block
RSpec.describe do
  subject { Thread.new { b }.join }
  let(:b) { }
  example { subject }
end

In your example, only the wait_time needs the eager calculation. Probably a more coherent solution is to not do these calculations as subject, which doesn't serve any obvious purpose in my mind: I would expect the subject to be a worker that we're asserting against, but really it's an array of threads, because no one ever uses the return value. So really, subject is just being used as a method to run some code. So, maybe giving it a more descriptive name and using it like a normal method. I gave it a shot here: https://github.com/JoshCheek/sqeduler/commit/4f40857f587d487e1de255c58191ce83aab7b9b7

jaredjenkins commented 9 years ago

Thanks @JoshCheek! This makes a ton of sense after looking over the code last night. Thanks for explanation.

xaviershay commented 7 years ago

So I'm thinking:

I don't want to leave this ticket open as is - it's not really clear what we'd fix inside RSpec.

Thoughts?

JoshCheek commented 7 years ago

I'm on board.

Ideally https://github.com/rspec/rspec-core/pull/2340 would have worked, providing the user with some feedback about why they're hitting the issue. However, it's definitely a complex domain, so I'm unsure if it's possible ("complex" here means that small changes in my understanding lead to wildly different conclusions).

JoshCheek commented 7 years ago

Oh, I think @myronmarston added the config, not me. Guess I should clarify whether we're talking about the configuration option or threadsafety as a feature?

myronmarston commented 7 years ago

Oh, I think @myronmarston added the config, not me.

I requested it, but you actually added it in #1858. At the time, the point of the config was to allow users to opt-out of threadsafe let because the locking adds some overhead.

Guess I should clarify whether we're talking about the configuration option or threadsafety as a feature?

My read of @xaviershay's comment is that he's suggesting we deprecate threadsafe let as a feature.

The config is another matter; if we plan to remove the threadsafe let feature, it won't make sense to support the config option at that time. But for now, we can't really deprecate it because the config option is the only way to opt-in to what would be the standard (and only) behavior in the future (e.g. via config.threadsafe = false).

Anyhow, I too appreciate the great work @JoshCheek did on this but agree with @xaviershay's conclusion.

JoshCheek commented 7 years ago

I opened an issue for this discussion so that we don't wind hijacking this issue.

mttkay commented 2 years ago

Note that this problem is much more far-reaching with Ruby 3, because Ruby 3 changed locking semantics to be per-fiber instead of per-thread with this commit: https://github.com/ruby/ruby/commit/3a3b19b2bba49e5d6f1cf13764eb6dd701397be9

This means that if threadsafe is true, any test subject that is based on Fibers and calls back into a memoized property (let, subject) will trigger a deadlock. I came up with what I think is a minimal executable test case:

# frozen_string_literal: true

RSpec.describe 'Ruby 3 dead locks' do
  describe 'with fibers' do
    let(:value) { 42 }

    subject do
      # At this point, the test thread (and fiber) hold the lock to memoize subject.
      Fiber.new do
        value # since the inner Fiber is now `current`, it will try to memoize the let block on the same lock
      end.resume
    end

    it 'locks up' do
      subject
    end
  end
end

Running this with Ruby 2 works:

$ rbenv shell 2.7.5
$ rspec ruby3_fiber_deadlock.rb
.

Finished in 0.00238 seconds (files took 0.06749 seconds to load)
1 example, 0 failures

With Ruby 3.0.4+, it fails:

$ rbenv shell 3.0.4
$ rspec ruby3_fiber_deadlock.rb
F

Failures:

  1) Ruby 3 dead locks with fibers locks up
     Failure/Error: value

     ThreadError:
       Attempt to unlock a mutex which is locked by another thread/fiber
     # ./ruby3_fiber_deadlock.rb:10:in `block (4 levels) in <top (required)>'
     # ------------------
     # --- Caused by: ---
     # fatal:
     #   No live threads left. Deadlock?
     #   1 threads, 1 sleeps current:0x00000000011fb080 main thread:0x00000000011fb080
     #   * #<Thread:0x0000000001266ff0 sleep_forever>
     #      rb_thread_t:0x00000000011fb080 native:0x00007f2a62680740 int:0
     #      
     #   ./ruby3_fiber_deadlock.rb:10:in `block (4 levels) in <top (required)>'

Finished in 0.10269 seconds (files took 0.05068 seconds to load)
1 example, 1 failure

Failed examples:

rspec ./ruby3_fiber_deadlock.rb:14 # Ruby 3 dead locks with fibers locks up

From a Ruby perspective, this is intended behavior; if a co-routine reads a value from memory and then suspends, then even on the same thread without locking primitives another co-routine may resume and mutate that value, invalidating assumptions the first co-routine may have about that value. So even in the absence of multi-threading this can be an issue.

I am not sure what RSpec should do about that; I agree that the threadsafe switch may not have enough utility to warrant users running into these problems, as they are notoriously difficult to debug. Note as well that Ruby uses fibers in its stdlib types such as Enumerator, which uses fibers to produce lazy streams of values. This is how we discovered this problem: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/102625

mttkay commented 2 years ago

@xaviershay Aside from the consideration to deprecate threadsafe, should we also default it to false? It currently defaults to true, and considering that Ruby 2 will reach EOL next March, it sounds like the benefits of threadsafe might not outweigh its drawbacks.

pirj commented 2 years ago

threadsafe currently defaults to true

Ruby 2 will reach EOL next March

You just surprised me twice, @mttkay, and both are valid statements ([1], [2].

The upcoming RSpec 4.0 is a perfect time to change the default.

However, without rushing with this decision, what are the implications of making this option opt-in instead of opt-out by default? There is another amazing Josh's explanation. I may read this code incorrectly, but it appears to me that the difference in count.count can be explained with different threads using different instances of Counter, and in this case, it would make the total count off by a multiple of 1000.

In theory, an accessor defined by a non-threadsafe let could detect if it was called from somewhere outside the main thread/fiber. That would neglect its performance benefits, but would let users know to add a :treadsafe flag on those examples that need it.

Another thing that doesn't seem related is parallel testing, as most of the current and WIP tools resort to sub-processes, not threads, for obvious reasons of mocking.

mttkay commented 2 years ago

The upcoming RSpec 4.0 is a perfect time to change the default.

Yep, agree!

However, without rushing with this decision, what are the implications of making this option opt-in instead of opt-out by default? There is another https://github.com/rspec/rspec-core/pull/1858.

This example, and discussion, is just another instance of the old pro-GIL and contra-GIL argument. MRI takes an opinionated approach and says that you should free developers from thinking about these problems and not require a lock in this case. Others disagree and say that providing mutual exclusion under presence of concurrency is not the VM's job, but the developer's job (or: a library's job -- there are many ways to ensure thread-safety that do not require you to use locking primitives directly. Examples are atomic CAS, actors, reactive streams, async/await etc.)

So, changing the default for threadsafe is somewhat equivalent to rspec-core taking a side in this argument. I am personally with camp-let-developers-decide, but that is just my personal opinion. Spinning out threads in test code sounds like an edge case; most Ruby based test suites will not do that. Of course some will, and for these people the implication will be that their tests will become flaky and they will need to ensure access to shared memory is handled correctly in these tests.

Another way to look at this is to apply the pareto principle: which use case is rspec trying to optimize for? The 0.1% or so who write multi-threaded test implementations, or the 99.9% who don't?

Locking also creates overhead; for very large test suites like the one GitLab runs, there will be hundreds of thousands of locking operations due to synchronized let and subject blocks. I am still looking into how threadsafe affects our pipeline durations, but this almost certainly has an impact on performance.

Another thing that doesn't seem related is parallel testing, as most of the current and WIP tools resort to sub-processes, not threads, for obvious reasons of mocking.

Hm, is this even affected? My understanding of parallel testing is that a test suite is subdivided and distributed across worker threads, but let and subject memoization only applies at the per-test level. So you can still distribute test methods across multiple threads and CPUs without running into integrity issues; or are there cases where rspec memoizes variables that will in fact be read across multiple test methods? Happy to be corrected here since I am not familiar with rspec internals, I just stumbled into this by working on Ruby 3 compatibility.

JonRowe commented 2 years ago

We are not changing the default to false, we actually should remove the false option entirely in RSpec 4 per the original plan for its implementation, it was introduced originally as opt in and if my memory is correct because default in RSpec 3.

Without threadsafe mode what happens is you can end up with different values memoized in different threads (and thus presumedly fibers) depending on race conditions.

We should fix the implementation to better detect or prevent deadlocks.

mttkay commented 2 years ago

Without threadsafe mode what happens is you can end up with different values memoized in different threads (and thus presumedly fibers) depending on race conditions.

Do you have an example of how that could happen when multiple fibers but only a single thread are involved? If that is true, then I agree it may be safer to maintain the threadsafe toggle. Otherwise, you're essentially optimizing for the minority of users and use cases that use multi-threaded test implementations and/or Ruby VMs that do not use a GIL.

mttkay commented 2 years ago

@JonRowe Here is an executable test case that should demonstrate the sample code posted in the original addition of threadsafe operates correctly at the Fiber level:

Updated the example to interleave each count to increment, which goes to show that even when switching between fibers for each call to increment, there are no integrity problems in the absence of synchronized let blocks:

# frozen_string_literal: true
require 'fiber'

RSpec.configure do |config|
  config.threadsafe = false
end

class Counter
  def initialize
    @counter = 0
  end

  def increment
    puts "#{Fiber.current}: incr"
    @counter += 1
  end

  def count
    @counter
  end
end

RSpec.describe 'Consistent reads' do
  describe 'with fibers' do
    let(:counter) { Counter.new }

    it 'increments counter consistently' do
      fibers = 10.times.map do
        Fiber.new do
          1000.times do
            counter.increment
            Fiber.yield
          end
        end
      end

      while fibers.any?(&:alive?) do
        fibers.each(&:resume)
      end

      expect(counter.count).to eq 10_000
    end
  end
end

This test passes for me in both Ruby 2 and Ruby 3 (MRI). This is because all fibers are serialized in the current test thread and execute sequentially. There are no race conditions.

The problem described in the original MR is, as far as I understand, specific to multi-threaded test code in Rubies that do not use a GIL. In other words, deciding to enable threadsafe by default means rspec is taking a stance to optimize for these implementations by default. That is certainly a valid opinion to have, but it seems to be at odds with how the vast majority of developers use and deploy Ruby apps?

mttkay commented 2 years ago

Side-note: this may of course change if someone uses a different FiberScheduler. I have not tested this with community-provided schedulers like https://brunosutic.com/blog/ruby-fiber-scheduler. None of these are currently part of MRI, however, so again this is a choice a developer makes. In that case, the developer should make sure their test code is properly synchronized.