Closed wks closed 3 months ago
This is reproducible on macos if you set the max heap lower. If you're on macos this will segv almost every time. I turned off the rust log so we don't get other failures due to stdout.
make -j12 test-all RUBY_TESTOPTS="--excludes-dir=test/.excludes-mmtk --seed=65183" RUN_OPTS="--mmtk-plan=MarkSweep --mmtk-max-heap=16GiB" RUST_LOG=none RUST_BACKTRACE=1
I then narrowed it down to two tests by running not in parallel. I can reproduce this with a combination of test/ruby/test_objectspace.rb
and test/fiddle/test_c_struct_entry.rb
make test-all RUBY_TESTOPTS="--excludes-dir=test/.excludes-mmtk --seed=65183" RUN_OPTS="--mmtk --mmtk-plan=MarkSweep --mmtk-max-heap=16GiB" RUST_BACKTRACE=1 RUBY_CODESIGN=1 TESTS="test/ruby/test_objectspace.rb test/fiddle/test_c_struct_entry.rb"
I've narrowed this down to two tests those files. TestObjectSpace#test_finalizer_thread_raise
and Fiddle::TestCStructEntity#test_free_with_func
. I then narrowed that down even more - all the second test needs to do to trigger this crash is call GC.start
.
Here's a really small repro script. It doesn't actually require lowering the memory to repro now. ./ruby --mmtk-plan=MarkSweep test.rb
is enough.
def test_finalizer_thread_raise
Thread.new do
end
GC.start
end
def test_free_with_func
GC.start
end
test_finalizer_thread_raise
test_free_with_func
The bug is in the Ruby binding. The return value of ActivePlan::number_of_mutators()
is sometimes different from the number of mutators returned from ActivePlan::mutators()
.
ActivePlan::number_of_mutators()
simply returns GET_VM()->ractor.main_ractor->threads.cnt
ActivePlan::mutators()
will iterate through all threads in main_ractor->threads.set
, and visit only threads that have non-NULL th->mutator
.Ruby caches native threads, and threads->set
includes both actively running Ruby threads and threads that finished executing Ruby threads and are waiting to be reused.
Function thread_create_core
calls rb_ractor_living_threads_insert
to increment r->threads.cnt
just before creating the native thread, and the newly created native thread will not set th->mutator
until thread_start_func_2
calls rb_mmtk_bind_mutator
. So if the main thread triggers GC using GC.start
, it will see threads.cnt == 2
, but the thread has not acquired a mutator struct from MMTk-core, yet. That'll trigger the bug.
The proper fix is to use the same loop in the current rb_mmtk_get_mutators
for both counting the number of mutators and visiting each mutator.
@eileencodes The PR https://github.com/mmtk/ruby/pull/84 for the ruby
repo fixes the bug. It has been merged into the dev/mmtk-overrides-default
branch. I also cherry-picked it onto the mmtk
branch of the https://github.com/mmtk/ruby repo so that you can merge or rebase https://github.com/mmtk/ruby/pull/80 with the mmtk
branch.
The PR https://github.com/mmtk/mmtk-ruby/pull/92 for the mmtk-ruby
repo is not necessary for fixing the bug in ActivePlan::number_of_mutators
. It affects the ruby
revision the mmtk-core
CI is testing against, and updates to a newer mmtk-core
revision that contains assertions that can detect this kind of bug. But currently we are having some problem with a random crash in cargo clippy
when running the CI for Darwin (Mac OS). So I merged the ruby
repo PR first. If you have any insights about cargo clippy
crashing on Mac OS, please share your experiences with us because we have been having this kind of problem for a while. See discussion at: https://mmtk.zulipchat.com/#narrow/stream/262673-mmtk-core/topic/clippy.20failing.20on.20darwin/near/455616980
We observed this assertion error when running MarkSweep: https://github.com/mmtk/mmtk-ruby/actions/runs/9853731634/job/27204907289?pr=83