oracle / truffleruby

A high performance implementation of the Ruby programming language, built on GraalVM.
https://www.graalvm.org/ruby/
Other
3k stars 183 forks source link

Autoloader racing on multiple threads #3040

Open itarato opened 1 year ago

itarato commented 1 year ago

Autoloader use with multiple threads cause premature use of the class.

Example

When executing mutliple Rails tests with the debugger: jt --use jvm-ce ruby --jdebug -S ./bin/rails test test/folder the execution fails with:

#<Thread:0x390d8 @active_support_execution_state={:active_support_execution_context=>{}} /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/minitest-5.18.0/lib/minitest/parallel.rb:28 run> terminated with exception (report_on_exception is true):
/home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/railtie.rb:54:in `block (2 levels) in <class:Railtie>': undefined method `reset_all' for ActiveSupport::CurrentAttributes:Class (NoMethodError)
Did you mean?  reset_callbacks
    from /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/callbacks.rb:445:in `instance_exec'
    from /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/callbacks.rb:445:in `block in make_lambda'
    from /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/callbacks.rb:199:in `block (2 levels) in halting'
    from /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/callbacks.rb:687:in `block (2 levels) in default_terminator'
    from <internal:core> core/throw_catch.rb:36:in `catch'
    from /home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/callbacks.rb:686:in `block in default_terminator'

The test executor spawns threads to spread the test execution, and most threads fail with this error (not all). What's happening is the very first thread is referring to ActiveSupport::CurrentAttributes, an unknown constant. The autoload definition in https://github.com/rails/rails/blob/2bfb6565a4996a19065e03300d9061e9eb265d92/activesupport/lib/active_support.rb#L42 delegates the resolution to Zeitwerk and it starts loading the class (snippet from the first thread load backtrace):

/home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/current_attributes.rb:4:in `/'
/home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/current_attributes.rb:4:in `<top (required)>'
<internal:core> core/kernel.rb:234:in `gem_original_require'
/home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/zeitwerk-2.6.8/lib/zeitwerk/kernel.rb:38:in `require'
/home/itarato/truffleruby-ws/graal/sdk/mxbuild/linux-amd64/GRAALVM_A33290E019_JAVA17/graalvm-a33290e019-java17-23.1.0-dev/languages/ruby/lib/gems/gems/activesupport-7.0.4.3/lib/active_support/railtie.rb:49:in `block (2 levels) in <class:Railtie>'

Meanwhile the other threads will just use the class as if it's already loaded - which is not the case, the class's Ruby code is still executing (function definitions) so some threads won't have all the methods available: hence the error. This was confirmed by logging file execution start/end.

Reproduce

Minimum example

Unfortunately reproducing it with a minimum example was tricky. It seems using Module#autolad vs ActiveSupport::Autoload#autoload was the same. I couldn't replicate it with --jvm-ce but saw something similar with --native.

Code

# Bait file.

class Foo
    sleep(0.1)

    def bar; end
end
require("thread")

autoload("Foo", "./foo.rb")
thread = [nil] * 8

thread[0] = Thread.new { Foo.new.bar }
thread[1] = Thread.new { Foo.new.bar }
thread[2] = Thread.new { Foo.new.bar }
thread[3] = Thread.new { Foo.new.bar }
thread[4] = Thread.new { Foo.new.bar }
thread[5] = Thread.new { Foo.new.bar }
thread[6] = Thread.new { Foo.new.bar }
thread[7] = Thread.new { Foo.new.bar }

thread.each(&:join)

Example output (6 failure, 2 success):

#<Thread:0x148 try_break_autoload.rb:26 run> terminated with exception (report_on_exception is true):
try_break_autoload.rb:26:in `block in <main>': undefined method `bar' for #<Foo:0x158> (NoMethodError)
#<Thread:0x138 try_break_autoload.rb:25 run> terminated with exception (report_on_exception is true):
try_break_autoload.rb:25:in `block in <main>': undefined method `bar' for #<Foo:0x168> (NoMethodError)

Observation

eregon commented 1 year ago

I think you are seeing https://github.com/oracle/truffleruby/issues/2431. That issue has more details on what's currently done and what's missing. The problem is we do synchronize so only one thread autoloads, but we also publish the constant as soon as it's set which means other threads can see it at that point, before the file is executed. I guess when the autoload constant is set we should consider it set only for the current thread doing the file loading, and then publish it when the current thread finishes loading that file.

Thanks for the detailed issue and reproduction.

itarato commented 1 year ago

Didn't get too far, still got confused by the nodes layout in execution. I've used the following code for debugging (safe enough to avoid racing but still threaded so the same-thread-locking guards can be tested):

require("thread")

autoload("Foo", "./foo.rb")

Foo.new.bar

Thread.new { Foo.new.bar }.join

At the first Foo.new.bar on the main thread it loads LookupConstantWithLexicalScopeNode which calls GetConstantNode. This is where Object has the constant field for Foo with autoload config. Here it starts the require: https://github.com/oracle/truffleruby/blob/7da015fbb94d880d19e1aca03d0343f7c1d570c9/src/main/java/org/truffleruby/language/constants/GetConstantNode.java#L124 This goes down the stack and calls GetConstantNode again (couldn't figure out why it needs a second one). This eventually replaces the constant entry for Object: https://github.com/oracle/truffleruby/blob/c1702b035af3f35b6080e3179d456407865c4291/src/main/java/org/truffleruby/core/module/ModuleFields.java#L487. And now this can return to the parent GetConstantNode and resolve the autoload lock: https://github.com/oracle/truffleruby/blob/7da015fbb94d880d19e1aca03d0343f7c1d570c9/src/main/java/org/truffleruby/language/constants/GetConstantNode.java#L129

Now when we go to the second lookup for Thread.new { Foo.new.bar }.join the LookupConstantWithLexicalScopeNode returns the new constant declaration for Foo which will just use the final Foo: https://github.com/oracle/truffleruby/blob/7da015fbb94d880d19e1aca03d0343f7c1d570c9/src/main/java/org/truffleruby/language/constants/GetConstantNode.java#L74

Only had vague ideas how to approach it:

I'm looking for some guidance on a better approach here.

eregon commented 1 year ago

One idea is for the constant being autoloaded to not set it yet but either store it in a new field of AutoloadConstant or maybe in the RubyConstant#value of the autoload RubyConstant. And then we can probably reuse the existing autoload locks, so as long as the other thread sees it's not set yet it will try to acquire the autoload lock and wait, so that's fine. And we have AutoloadConstant#isAutoloadingThread() to know which thread is allowed to see the value.

eregon commented 1 year ago

A reliable reproducer:

# main.rb
autoload :C, "#{__dir__}/c.rb"

$inside = false

t = Thread.new { p C.new.foo }
Thread.pass until $inside
p C.new.foo

t.join
# c.rb
class C
  def bar
  end

  $inside = true
  sleep 1

  def foo
    42
  end
end
eregon commented 1 year ago

I tried an edge case, what if while loading the file we create another thread which tries to access the being-autoloaded constant? Answer: it deadlocks on CRuby:

# c.rb
class C
  def bar
  end

  Thread.new { p C }.join

  $inside = true
  sleep 1

  def foo
    42
  end
end

So it's simple, we don't need to handle that case, it can just hang.

If we simplify main.rb we can get CRuby to print the deadlock:

# main.rb
autoload :C, "#{__dir__}/c.rb"
p C.new.foo

gives

/home/eregon/code/truffleruby-ws/truffleruby/autoload/c.rb:5:in `join': No live threads left. Deadlock? (fatal)
2 threads, 2 sleeps current:0x0000000001bdfae0 main thread:0x0000000001b03060
* #<Thread:0x00007f26bfe1fcc0 sleep_forever>
   rb_thread_t:0x0000000001b03060 native:0x00007f26bff40740 int:0

* #<Thread:0x00007f26bfdd8eb0 /home/eregon/code/truffleruby-ws/truffleruby/autoload/c.rb:5 sleep_forever>
   rb_thread_t:0x0000000001bdfae0 native:0x00007f26af30b6c0 int:0
    depended by: tb_thread_id:0x0000000001b03060

    from /home/eregon/code/truffleruby-ws/truffleruby/autoload/c.rb:5:in `<class:C>'
    from /home/eregon/code/truffleruby-ws/truffleruby/autoload/c.rb:1:in `<top (required)>'
    from <internal:/home/eregon/.rubies/ruby-3.1.3/lib/ruby/3.1.0/rubygems/core_ext/kernel_require.rb>:85:in `require'
    from <internal:/home/eregon/.rubies/ruby-3.1.3/lib/ruby/3.1.0/rubygems/core_ext/kernel_require.rb>:85:in `require'
    from main.rb:2:in `<main>'
eregon commented 1 year ago

With Fibers on 3.2, with Fiber.new { p C }.resume instead of Thread.new { p C }.join it's:

c.rb:5:in `block in <class:C>': deadlock; lock already owned by another fiber belonging to the same thread (ThreadError)

So just an early error but still a deadlock. On 3.1 it seems to use a per-Ruby-Thread lock and it "works", but let's use 3.2 semantics it's far easier for TruffleRuby.

eregon commented 1 year ago

I have a fix for this in https://github.com/oracle/truffleruby/pull/3078, I'm still polishing it a bit. Notably there is the case of module M; autoload :OpenSSL, "openssl"; p OpenSSL; end I need to review.

I also noticed we still have the notion of "undefined constants", but this should have been removed in CRuby 3.1, so now we should be able to clean that up as well. But probably best done separately, it's already quite a complex change as-is.