`abandoned_in_gc` not empty in MarkSweep

We observed this assertion error when running MarkSweep: https://github.com/mmtk/mmtk-ruby/actions/runs/9853731634/job/27204907289?pr=83

  | [2024[2024-07-09T08:43:46Z INFO  mmtk::util::heap::gc_trigger] [POLL] ms: Triggering collection (10987/10559 pag[2024-07-09T08:43:46Z INFO  mmtk::scheduler::scheduler] End of GC (1016/12098 pages, took 132 ms)
  | bootstraptest.test_io.rb_90_693.rb:16:in 'IO.copy_stream': Interrupt
  |     from bootstraptest.test_io.rb_90_693.rb:16:in '<main>'
  |  MMTk GC thread panicked.  This is a bug.
  | panicked at /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/6cae51c/src/policy/marksweepspace/native_ms/global.rs:150:13:
  | assertion failed: self.available[i].is_empty()
  |    0: mmtk_ruby::handle_gc_thread_panic
  |              at /home/runner/work/mmtk-ruby/mmtk-ruby/git/mmtk-ruby/mmtk/src/lib.rs:101:14
  |    1: mmtk_ruby::set_panic_hook::{{closure}}
  |              at /home/runner/work/mmtk-ruby/mmtk-ruby/git/mmtk-ruby/mmtk/src/lib.rs:130:13
  |    2: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/alloc/src/boxed.rs:2029:9
  |    3: std::panicking::rust_panic_with_hook
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:785:13
  |    4: std::panicking::begin_panic_handler::{{closure}}
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:651:13
  |    5: std::sys_common::backtrace::__rust_end_short_backtrace
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:171:18
  |    6: rust_begin_unwind
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:647:5
  |    7: core::panicking::panic_fmt
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panicking.rs:72:14
  |    8: core::panicking::panic
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panicking.rs:144:5
  |    9: mmtk::policy::marksweepspace::native_ms::global::AbandonedBlockLists::assert_empty
  |              at /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/6cae51c/src/policy/marksweepspace/native_ms/global.rs:150:13
  |   10: mmtk::policy::marksweepspace::native_ms::global::MarkSweepSpace<VM>::prepare
  |              at /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/6cae51c/src/policy/marksweepspace/native_ms/global.rs:341:9
  |   11: <mmtk::plan::marksweep::global::MarkSweep<VM> as mmtk::plan::global::Plan>::prepare
  |              at /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/6cae51c/src/plan/marksweep/global.rs:60:9
  |   12: <mmtk::scheduler::gc_work::Prepare<C> as mmtk::scheduler::work::GCWork<<C as mmtk::scheduler::work::GCWorkContext>::VM>>::do_work
  |              at /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/6cae51c/src/scheduler/gc_work.rs:60:9
  |   13: mmtk::scheduler::work::GCWork::do_work_with_stat
  |              at /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/6cae51c/src/scheduler/work.rs:45:9
  |   14: mmtk::scheduler::worker::GCWorker<VM>::run
  |              at /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/6cae51c/src/scheduler/worker.rs:244:13
  |   15: mmtk::memory_manager::start_worker
  |              at /home/runner/.cargo/git/checkouts/mmtk-core-3306bdeb8eb4322b/6cae51c/src/memory_manager.rs:492:5
  |   16: <mmtk_ruby::collection::VMCollection as mmtk::vm::collection::Collection<mmtk_ruby::Ruby>>::spawn_gc_thread::{{closure}}
  |              at /home/runner/work/mmtk-ruby/mmtk-ruby/git/mmtk-ruby/mmtk/src/collection.rs:54:21
  |   17: std::sys_common::backtrace::__rust_begin_short_backtrace
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:155:18
  |   18: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/thread/mod.rs:529:17
  |   19: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panic/unwind_safe.rs:272:9
  |   20: std::panicking::try::do_call
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:554:40
  |   21: __rust_try
  |   22: std::panicking::try
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:518:19
  |   23: std::panic::catch_unwind
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panic.rs:142:14
  |   24: std::thread::Builder::spawn_unchecked_::{{closure}}
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/thread/mod.rs:528:30
  |   25: core::ops::function::FnOnce::call_once{{vtable.shim}}
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/ops/function.rs:250:5
  |   26: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/alloc/src/boxed.rs:2015:9
  |   27: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/alloc/src/boxed.rs:2015:9
  |   28: std::sys::pal::unix::thread::Thread::new::thread_start
  |              at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys/pal/unix/thread.rs:108:17
  |   29: <unknown>
  |   30: <unknown>
  | 
    megacontent-copy_stream

This is reproducible on macos if you set the max heap lower. If you're on macos this will segv almost every time. I turned off the rust log so we don't get other failures due to stdout.

make -j12 test-all RUBY_TESTOPTS="--excludes-dir=test/.excludes-mmtk --seed=65183" RUN_OPTS="--mmtk-plan=MarkSweep --mmtk-max-heap=16GiB" RUST_LOG=none RUST_BACKTRACE=1

I then narrowed it down to two tests by running not in parallel. I can reproduce this with a combination of test/ruby/test_objectspace.rb and test/fiddle/test_c_struct_entry.rb

make test-all RUBY_TESTOPTS="--excludes-dir=test/.excludes-mmtk --seed=65183" RUN_OPTS="--mmtk --mmtk-plan=MarkSweep --mmtk-max-heap=16GiB" RUST_BACKTRACE=1 RUBY_CODESIGN=1 TESTS="test/ruby/test_objectspace.rb test/fiddle/test_c_struct_entry.rb"

I've narrowed this down to two tests those files. TestObjectSpace#test_finalizer_thread_raise and Fiddle::TestCStructEntity#test_free_with_func. I then narrowed that down even more - all the second test needs to do to trigger this crash is call GC.start.

Here's a really small repro script. It doesn't actually require lowering the memory to repro now. ./ruby --mmtk-plan=MarkSweep test.rb is enough.

def test_finalizer_thread_raise
  Thread.new do
  end
  GC.start
end

def test_free_with_func
  GC.start
end

test_finalizer_thread_raise
test_free_with_func

The bug is in the Ruby binding. The return value of ActivePlan::number_of_mutators() is sometimes different from the number of mutators returned from ActivePlan::mutators().

ActivePlan::number_of_mutators() simply returns GET_VM()->ractor.main_ractor->threads.cnt
ActivePlan::mutators() will iterate through all threads in main_ractor->threads.set, and visit only threads that have non-NULL th->mutator.

Ruby caches native threads, and threads->set includes both actively running Ruby threads and threads that finished executing Ruby threads and are waiting to be reused.

Function thread_create_core calls rb_ractor_living_threads_insert to increment r->threads.cnt just before creating the native thread, and the newly created native thread will not set th->mutator until thread_start_func_2 calls rb_mmtk_bind_mutator. So if the main thread triggers GC using GC.start, it will see threads.cnt == 2, but the thread has not acquired a mutator struct from MMTk-core, yet. That'll trigger the bug.

The proper fix is to use the same loop in the current rb_mmtk_get_mutators for both counting the number of mutators and visiting each mutator.

@eileencodes The PR https://github.com/mmtk/ruby/pull/84 for the ruby repo fixes the bug. It has been merged into the dev/mmtk-overrides-default branch. I also cherry-picked it onto the mmtk branch of the https://github.com/mmtk/ruby repo so that you can merge or rebase https://github.com/mmtk/ruby/pull/80 with the mmtk branch.

The PR https://github.com/mmtk/mmtk-ruby/pull/92 for the mmtk-ruby repo is not necessary for fixing the bug in ActivePlan::number_of_mutators. It affects the ruby revision the mmtk-core CI is testing against, and updates to a newer mmtk-core revision that contains assertions that can detect this kind of bug. But currently we are having some problem with a random crash in cargo clippy when running the CI for Darwin (Mac OS). So I merged the ruby repo PR first. If you have any insights about cargo clippy crashing on Mac OS, please share your experiences with us because we have been having this kind of problem for a while. See discussion at: https://mmtk.zulipchat.com/#narrow/stream/262673-mmtk-core/topic/clippy.20failing.20on.20darwin/near/455616980

mmtk / mmtk-ruby

`abandoned_in_gc` not empty in MarkSweep #84