Open 375gnu opened 9 years ago
Thanks for the clear bug report -- having it bisected down to one commit is a huge help!
@JoshCheek -- this report makes it sound like there may be a deadlock condition in the thread safety stuff you added. Are you able to take a look at this?
Checking it out now. I have it reproduced locally. Difficult to tell what's going on, though, b/c Celluloid is doing some sort of metaprogramming that causes objects to not necessarily be what you think they are, and methods to disappear depending on where you're looking from. I'll spend another hour or two, but really hope I don't have to read all of the Celluloid source code. ...I should probably figure out what this gem actually does :P
CCing @tarcieri, b/c it's not improbable that more people are going to hit a similar thing and report an issue like this.
The main thread acquires the mutex lock, but the code it is running instantiates a Celluloid actor (think that's the right word) which puts it to sleep and then winds up trying to acquire the same lock.
(note: low level explanation is down below)
subject
, which locks a mutex while it calculates the value (to facilitate lazy but threadsafe initialization)subject
block, it instantiates the class it is testing (it's the subject, after all!)Celluloid
to put the main thread to sleep while it is initializing the object (actor?).subject
to return a value).main
to release the lock.Eager load the connection so that it doesn't go into the mutex from the Celluloid thread. This can be done in one of two ways:
Pre-fetch the value (placing it first, b/c it's the best of the easy solutions)
let!(:connection) do
double('connection')
end
Use a local variable instead of calling the lazy let block.
subject do
c = connection
expect(Octokit::Client).to receive(:new) { c }
described_class.new(organization: "opscode-cookbooks", access_token: "asdf")
end
config.threadsafe = false
unfortunately, there is not a way to do this on a per-spec basis.From the main thread, we enter the spec here, which calls subject
it "does not start with Ruby object notation" do
expect(subject.to_s).to_not start_with("#<")
end
We enter the definition of subject
here, (it was overridden here), which calls __memoized.fetch_or_store(:subject)
define_method(name) { __memoized.fetch_or_store(name) { super(&nil) } }
We enter RSpec::Core::MemoizedHelpers::ThreadsafeMemoized#fetch_or_store
here, which checks to see if the hash has a value for :subject
. It doesn't as this is the first call, so it calls @mutex.synchronize
, acquires the lock, checks again to see if it has been calculated (in case it got locked out and slept while some other thread calculated the value), and then yields to the block.
def fetch_or_store(key)
@memoized.fetch(key) do # only first access pays for synchronization
@mutex.synchronize do
@memoized.fetch(key) { @memoized[key] = yield }
end
end
end
This takes us back to the block above, which calls super
, and hits the method that was defined from the block's body, taking it into the subject
block here. Which mocks out Octokit::Client.new
to call connection
, then calls Berkshelf::API::CacheBuilder::Worker::Github.new
.
subject do
expect(Octokit::Client).to receive(:new) { connection }
described_class.new(organization: "opscode-cookbooks", access_token: "asdf")
end
This takes it into Celluloid::ClassMethods#new
here, because it was extended onto the class here, when Celluloid
was included here, and the Worker::Github
inherited this behaviour here. Inside Celluloid::ClassMethods#new
, it gets an instance of Celluloid::Proxy::Cell
here, and sends it :initialize
.
def new(*args, &block)
proxy = Cell.new(allocate, behavior_options, actor_options).proxy
proxy._send_(:initialize, *args, &block)
proxy
end
It delegates initialize
to method_missing
here, which puts the work of calling the real initialize
method into some sort of queue and then asks for the value here, which ultimately puts the main thread to sleep here.
def response
Celluloid.suspend(:callwait, self)
end
def value
response.value
end
I think a pool eventually pulls it out here or maybe here (I had followed it for a while, but wound up getting lost again, and don't feel like going through it again). At this point, it is in one of Celluloid
's threads.
thread = Thread.new do
while proc = queue.pop
proc.call
# ...
end
end
This eventually winds up calling into Berkshelf::API::CacheBuilder::Worker::Github#initialize
here
def initialize(options = {})
@connection = Octokit::Client.new(access_token: options[:access_token], auto_paginate: true,
api_endpoint: options[:api_endpoint], web_endpoint: options[:web_endpoint],
connection_options: {ssl: {verify: options[:ssl_verify].nil? ? true : options[:ssl_verify]}})
# ...
end
That goes through some layers of RSpec::Mocks, until it eventually finds its way back to the block defined in the spec, which calls connection
, here
subject do
expect(Octokit::Client).to receive(:new) { connection }
described_class.new(organization: "opscode-cookbooks", access_token: "asdf")
end
The call to connection goes through the same path as the call to subject
, until it eventually hits the memoized helper, tries to synchronize, and is put to sleep because the mutex is locked by the main thread, which is still sleeping, waiting for the current thread to return with the value of subject
.
def fetch_or_store(key)
@memoized.fetch(key) do # only first access pays for synchronization
@mutex.synchronize do
@memoized.fetch(key) { @memoized[key] = yield }
end
end
end
Frankly, this code is much better than the vast majority of code that I see. But I get quite excited about testing and design, so I'll share the things I thought about as I considered how you could solve this issue.
There are certain design issues that will make tests painful. In some regard, there will always be pain (if you want to do anything interesting), but the trick is to segregate the painful thing from the rest of the code.
In this case, the pain comes from... well, a lot of things, TBH, but within code you control, it comes from having to mock out singleton methods on global objects. There are any number of ways to get out of having to do this, but the one that seems apparent to me right now is passing the connection in rather than taking all of its arguments and constructing it via the constant:
This is better because it absolves the code from having to know contextual details that it can't always know. It lets the user to choose how it is instantiated, so if they had some need to give it more or different options, then they can easily do so. It also allows them to choose what your code will collaborate with... eg, to introduce a mock connection in a test, or talk to GH using some lib other than Octokit
, if it is causing some issue that can't be worked around (we would create a wrapper that exposed the interface this code uses, but implemented it using the other lib).
Now, people generally have this opinion that the lib should have some slim interface, and thus dislike that a user might have to configure Octokit. I generally disagree, and would rather have access to the wiring than an initialization method that does all the work for me (I find such code usually has a high cost and low value when I write it, and usually prevents me from accomplishing my goal when I other people's code that does this as it makes incorrect assumptions about my context). But sometimes there is value in convenience like this. Fortunately, it's very easy to provide both, we just move that convenience code into a different constructor:
Interestingly, this one change fixes the issue, as well, because the Celluloid
thread is not created until .new
is called, which is now separated from the Octokit
instantiation.
The core insight to all of this is that when there is unit test pain, it should be looked at as an indication that there are issues with the design. And the solution is almost always the same: take the cause of the pain and push up the callstack. This was one of probably 3 big realizations that took me forever to discover, but ultimately made testing easy, valuable, and compelling.
Once we push this pain up the callstack, it becomes the test's responsibility to provide the connection, which makes it trivial to mock out without needing to stub methods on singleton objects.
Another thing to think about is that this is a wrapper around a 3rd party lib (Octokit
), and is right at the boundary of the system. This is a big reason why we wind up with pain of this nature. It is good in that we segregated it in this one worker, so that the rest of the codebase doesn't have to deal with it, but notice how annoying it is to test. Literally every method we call on it is mocked out.
We have very low confidence that it actually works, because we don't control Octokit
, but any change to Octokit could break berkshelf while our tests still pass, because our mocked methods still behave as expected. This is why people often say to avoid mocking code you don't own. Ideally, Octokit
would provide us a mock client we could use, but I glanced through their code and didn't see one (looks like they use VCR
and webmock
).
My general experience is that code this close to the boundaries is best tested at the integration level. There are simply too many reasons for it to change, and we control very few of them. We could introduce more abstraction, but there doesn't appear to be much value at this point: the problem code is already pulled out of the rest of the system and into this worker. The code in the worker deals almost entirely with wiring, not with business logic.
So, probably something like VCR
would be a good choice to increase the confidence in the integration code and reduce the volatility.
I probably wouldn't put in the effort to do this right now, given that you can also get it up to date by just adding a !
to a let block, but if it winds up being flaky and a total headache, then probably go ginger cake their tests and setup.
Wow, amazing, sleuthing @JoshCheek!
I think a simpler fix is to change:
expect(Octokit::Client).to receive(:new) { connection }
...to:
expect(Octokit::Client).to receive(:new).and_return(connection)
This should cause connection
to be eagerly evaluated, before the celluloid actor puts the thread to sleep.
Is there something we can do in RSpec to either work around this issue or detect it and warn the user or something?
cc @digitalextremist @ioquatix
Wow, amazing, sleuthing @JoshCheek!
I think a simpler fix is to change:
expect(Octokit::Client).to receive(:new) { connection }
...to:
expect(Octokit::Client).to receive(:new).and_return(connection)
This should cause connection to be eagerly evaluated, before the celluloid actor puts the thread to sleep.
Aye, good call.
Is there something we can do in RSpec to either work around this issue or detect it and warn the user or something?
Maybe. I doubt we could tell them what the right thing to do is, but it might be possible to detect the recursive deadlock and give them enough information to resolve it themselves. If the thread is going to ask for the lock from the ReentrantMutex, and the lock is held by one of the thread's ancestors, then we can at least know the deadlock will occur, and possibly give them some useful info about the context.
It's then a question of whether we can know a thread's ancestor threads (I don't see any way to get this info), how expensive it is to track / query, and whether we can get enough information to be useful (ie get the callstacks of the relevant threads).
Interestingly, a deadlock situation like this would normally raise fatal
, which we could catch and then try to piece the situation together at that time (rather than every time a let block is calculated). I'm not really sure why this code doesn't, I assume Celluloid
has another thread running somewhere.
Is this a ticket that should still be open?
Is this a ticket that should still be open?
I don't believe we've solved it yet.
I guess it depends on your definition of solved. I think we can put some amount of work in to inform a user when they hit this situation, but IDK if it is worth it, unless lots of users wind up hitting this. While I have a desire to get huffy and think "write better code", that doesn't really help anyone or negate the fact that it's systemic. Alternatively, maybe it's not systemic, and showing them how is fine. Anyway, if it pops up a lot, then probably trying to automate the conveyance of useful debugging information is worthwhile. If not, then pointers here and there are probably fine. Ping me next time it happens, I'll evaluate how effortful it would be to automate the deadlock detection (though I have a general opinion that communicating nuanced info like this is very difficult).
@tarcieri @JoshCheek I think I am running into a similar problem here: https://github.com/square/sqeduler/commit/12babd93e6148e7b1874c219790658f6252694c0. Threads deadlocking if I don't eagerly load the inputs which are provided via let
blocks.
Yeah, subject
is made threadsafe, so when you call it, it locks other threads out while calculating the subject. However, within the subject calculation, a new thread is made, and then joined, blocking the subject thread, which has the lock. Within that new thread, it accesses wait_time
, which needs the lock to be calculated. Thus, they deadlock (subject waiting for the thread to finish, while the thread waits for subject to release the lock).
# Here is a minimal reproduction
# note that `subject` here is a red herring, it is really the same as a let-block
RSpec.describe do
subject { Thread.new { b }.join }
let(:b) { }
example { subject }
end
In your example, only the wait_time
needs the eager calculation. Probably a more coherent solution is to not do these calculations as subject
, which doesn't serve any obvious purpose in my mind: I would expect the subject to be a worker that we're asserting against, but really it's an array of threads, because no one ever uses the return value. So really, subject
is just being used as a method to run some code. So, maybe giving it a more descriptive name and using it like a normal method. I gave it a shot here: https://github.com/JoshCheek/sqeduler/commit/4f40857f587d487e1de255c58191ce83aab7b9b7
Thanks @JoshCheek! This makes a ton of sense after looking over the code last night. Thanks for explanation.
So I'm thinking:
threadsafe
to explicitly call out the risk of deadlocks.threadsafe
config all together next time we get the opportunity, and deprecate it now. While I'm impressed and grateful for all the detailed work @JoshCheek put in to it (seriously - amazing!), the motivation doesn't seem quite strong enough to justify the added complexities? When I look at the counter
example, just using let!
or other non-lazy initialization seems pretty straightforward to me.I don't want to leave this ticket open as is - it's not really clear what we'd fix inside RSpec.
Thoughts?
I'm on board.
Ideally https://github.com/rspec/rspec-core/pull/2340 would have worked, providing the user with some feedback about why they're hitting the issue. However, it's definitely a complex domain, so I'm unsure if it's possible ("complex" here means that small changes in my understanding lead to wildly different conclusions).
Oh, I think @myronmarston added the config, not me. Guess I should clarify whether we're talking about the configuration option or threadsafety as a feature?
Oh, I think @myronmarston added the config, not me.
I requested it, but you actually added it in #1858. At the time, the point of the config was to allow users to opt-out of threadsafe let
because the locking adds some overhead.
Guess I should clarify whether we're talking about the configuration option or threadsafety as a feature?
My read of @xaviershay's comment is that he's suggesting we deprecate threadsafe let
as a feature.
The config is another matter; if we plan to remove the threadsafe let
feature, it won't make sense to support the config option at that time. But for now, we can't really deprecate it because the config option is the only way to opt-in to what would be the standard (and only) behavior in the future (e.g. via config.threadsafe = false
).
Anyhow, I too appreciate the great work @JoshCheek did on this but agree with @xaviershay's conclusion.
I opened an issue for this discussion so that we don't wind hijacking this issue.
Note that this problem is much more far-reaching with Ruby 3, because Ruby 3 changed locking semantics to be per-fiber instead of per-thread with this commit: https://github.com/ruby/ruby/commit/3a3b19b2bba49e5d6f1cf13764eb6dd701397be9
This means that if threadsafe
is true
, any test subject that is based on Fibers and calls back into a memoized property (let
, subject
) will trigger a deadlock. I came up with what I think is a minimal executable test case:
# frozen_string_literal: true
RSpec.describe 'Ruby 3 dead locks' do
describe 'with fibers' do
let(:value) { 42 }
subject do
# At this point, the test thread (and fiber) hold the lock to memoize subject.
Fiber.new do
value # since the inner Fiber is now `current`, it will try to memoize the let block on the same lock
end.resume
end
it 'locks up' do
subject
end
end
end
Running this with Ruby 2 works:
$ rbenv shell 2.7.5
$ rspec ruby3_fiber_deadlock.rb
.
Finished in 0.00238 seconds (files took 0.06749 seconds to load)
1 example, 0 failures
With Ruby 3.0.4+, it fails:
$ rbenv shell 3.0.4
$ rspec ruby3_fiber_deadlock.rb
F
Failures:
1) Ruby 3 dead locks with fibers locks up
Failure/Error: value
ThreadError:
Attempt to unlock a mutex which is locked by another thread/fiber
# ./ruby3_fiber_deadlock.rb:10:in `block (4 levels) in <top (required)>'
# ------------------
# --- Caused by: ---
# fatal:
# No live threads left. Deadlock?
# 1 threads, 1 sleeps current:0x00000000011fb080 main thread:0x00000000011fb080
# * #<Thread:0x0000000001266ff0 sleep_forever>
# rb_thread_t:0x00000000011fb080 native:0x00007f2a62680740 int:0
#
# ./ruby3_fiber_deadlock.rb:10:in `block (4 levels) in <top (required)>'
Finished in 0.10269 seconds (files took 0.05068 seconds to load)
1 example, 1 failure
Failed examples:
rspec ./ruby3_fiber_deadlock.rb:14 # Ruby 3 dead locks with fibers locks up
From a Ruby perspective, this is intended behavior; if a co-routine reads a value from memory and then suspends, then even on the same thread without locking primitives another co-routine may resume and mutate that value, invalidating assumptions the first co-routine may have about that value. So even in the absence of multi-threading this can be an issue.
I am not sure what RSpec should do about that; I agree that the threadsafe
switch may not have enough utility to warrant users running into these problems, as they are notoriously difficult to debug. Note as well that Ruby uses fibers in its stdlib types such as Enumerator
, which uses fibers to produce lazy streams of values. This is how we discovered this problem: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/102625
@xaviershay Aside from the consideration to deprecate threadsafe
, should we also default it to false
? It currently defaults to true
, and considering that Ruby 2 will reach EOL next March, it sounds like the benefits of threadsafe
might not outweigh its drawbacks.
threadsafe
currently defaults totrue
Ruby 2 will reach EOL next March
You just surprised me twice, @mttkay, and both are valid statements ([1], [2].
The upcoming RSpec 4.0 is a perfect time to change the default.
However, without rushing with this decision, what are the implications of making this option opt-in instead of opt-out by default?
There is another amazing Josh's explanation.
I may read this code incorrectly, but it appears to me that the difference in count.count
can be explained with different threads using different instances of Counter
, and in this case, it would make the total count off by a multiple of 1000.
In theory, an accessor defined by a non-threadsafe let
could detect if it was called from somewhere outside the main thread/fiber. That would neglect its performance benefits, but would let users know to add a :treadsafe
flag on those examples that need it.
Another thing that doesn't seem related is parallel testing, as most of the current and WIP tools resort to sub-processes, not threads, for obvious reasons of mocking.
The upcoming RSpec 4.0 is a perfect time to change the default.
Yep, agree!
However, without rushing with this decision, what are the implications of making this option opt-in instead of opt-out by default? There is another https://github.com/rspec/rspec-core/pull/1858.
This example, and discussion, is just another instance of the old pro-GIL and contra-GIL argument. MRI takes an opinionated approach and says that you should free developers from thinking about these problems and not require a lock in this case. Others disagree and say that providing mutual exclusion under presence of concurrency is not the VM's job, but the developer's job (or: a library's job -- there are many ways to ensure thread-safety that do not require you to use locking primitives directly. Examples are atomic CAS, actors, reactive streams, async/await etc.)
So, changing the default for threadsafe
is somewhat equivalent to rspec-core taking a side in this argument. I am personally with camp-let-developers-decide, but that is just my personal opinion. Spinning out threads in test code sounds like an edge case; most Ruby based test suites will not do that. Of course some will, and for these people the implication will be that their tests will become flaky and they will need to ensure access to shared memory is handled correctly in these tests.
Another way to look at this is to apply the pareto principle: which use case is rspec trying to optimize for? The 0.1% or so who write multi-threaded test implementations, or the 99.9% who don't?
Locking also creates overhead; for very large test suites like the one GitLab runs, there will be hundreds of thousands of locking operations due to synchronized let and subject blocks. I am still looking into how threadsafe
affects our pipeline durations, but this almost certainly has an impact on performance.
Another thing that doesn't seem related is parallel testing, as most of the current and WIP tools resort to sub-processes, not threads, for obvious reasons of mocking.
Hm, is this even affected? My understanding of parallel testing is that a test suite is subdivided and distributed across worker threads, but let and subject memoization only applies at the per-test level. So you can still distribute test methods across multiple threads and CPUs without running into integrity issues; or are there cases where rspec memoizes variables that will in fact be read across multiple test methods? Happy to be corrected here since I am not familiar with rspec internals, I just stumbled into this by working on Ruby 3 compatibility.
We are not changing the default to false, we actually should remove the false option entirely in RSpec 4 per the original plan for its implementation, it was introduced originally as opt in and if my memory is correct because default in RSpec 3.
Without threadsafe mode what happens is you can end up with different values memoized in different threads (and thus presumedly fibers) depending on race conditions.
We should fix the implementation to better detect or prevent deadlocks.
Without threadsafe mode what happens is you can end up with different values memoized in different threads (and thus presumedly fibers) depending on race conditions.
Do you have an example of how that could happen when multiple fibers but only a single thread are involved? If that is true, then I agree it may be safer to maintain the threadsafe toggle. Otherwise, you're essentially optimizing for the minority of users and use cases that use multi-threaded test implementations and/or Ruby VMs that do not use a GIL.
@JonRowe Here is an executable test case that should demonstrate the sample code posted in the original addition of threadsafe
operates correctly at the Fiber level:
Updated the example to interleave each count to increment, which goes to show that even when switching between fibers for each call to increment, there are no integrity problems in the absence of synchronized let blocks:
# frozen_string_literal: true
require 'fiber'
RSpec.configure do |config|
config.threadsafe = false
end
class Counter
def initialize
@counter = 0
end
def increment
puts "#{Fiber.current}: incr"
@counter += 1
end
def count
@counter
end
end
RSpec.describe 'Consistent reads' do
describe 'with fibers' do
let(:counter) { Counter.new }
it 'increments counter consistently' do
fibers = 10.times.map do
Fiber.new do
1000.times do
counter.increment
Fiber.yield
end
end
end
while fibers.any?(&:alive?) do
fibers.each(&:resume)
end
expect(counter.count).to eq 10_000
end
end
end
This test passes for me in both Ruby 2 and Ruby 3 (MRI). This is because all fibers are serialized in the current test thread and execute sequentially. There are no race conditions.
The problem described in the original MR is, as far as I understand, specific to multi-threaded test code in Rubies that do not use a GIL. In other words, deciding to enable threadsafe
by default means rspec is taking a stance to optimize for these implementations by default. That is certainly a valid opinion to have, but it seems to be at odds with how the vast majority of developers use and deploy Ruby apps?
Side-note: this may of course change if someone uses a different FiberScheduler
. I have not tested this with community-provided schedulers like https://brunosutic.com/blog/ruby-fiber-scheduler. None of these are currently part of MRI, however, so again this is a choice a developer makes. In that case, the developer should make sure their test code is properly synchronized.
Hi.
I working on debian package for berkshelf-api, in debian we have rspec-* 3.3.0 and I have problem to run testsuite because it hangs (on this line: https://github.com/berkshelf/berkshelf-api/blob/master/spec/unit/berkshelf/api/cache_builder/worker/github_spec.rb#L51, https://github.com/berkshelf/berkshelf-api/blob/master/spec/support/human_reaable.rb#L3), While it works perfectly with 2.99.0.
I started to bisect and found that hangs are caused by rspec-core commit ffe00a1d4.