wks commented 1 month ago

Error happens when running the TestAutoload#test_autoload_fork test case in test-all. This is not always reproducible, but is quite often reproducible on the dev/mmtk-overrides-default branch, with the StickyImmix plan.

The bug manifests in multiple ways.

"Object {} does not have HAS_MOVED_GIVTBL flag or original givtbl": This is the most often manifestation.
"[BUG] rb_gc_mark(): unknown data type 0x10(0x0000020100327200) non object": This also happens.

wks commented 1 month ago

There is a very high chance to trigger an error when running the following command with the current master branch of mmtk-ruby and the dev/mmtk-overrides-default branch of ruby, and using StickyImmix and without specifying heap size:

MMTK_THREADS=1 taskset -c 0 '/home/wks/projects/mmtk-github/ruby/build-debug/ruby' '--mmtk-plan=StickyImmix' '-r../tool/lib/_tmpdir' '../test/runner.rb' '--ruby=./miniruby -I../lib -I. -I.ext/common  ../tool/runruby.rb --extout=.ext  -- --mmtk-plan=StickyImmix' '--excludes-dir=../test/.excludes' '--name=!/memory_leak/' '-v' '../test/ruby/test_autoload.rb' --seed=32744

But unlike in the last post (about givtbl or unknown data type), the error manifests as visiting a field pointing to an address that does not have the VO bit set (not allocated, or already recycled).

Destination is not an MMTk object. Src: 0x200ffc06c10 dst: 0x201001d4a78

It won't trigger the bug if I add -n /fork/ to run only test_autoload_fork. It means it requires some data structures to be initialized in some specific ways in order to trigger the bug. It also seems that the bug can only be triggered if GC happens just at the right moment (when auto-loading).

test_autoload_fork creates new threads and forks. We need at least two threads to trigger the bug. That is, replacing the 3.times do in test_autoload_fork with 2.times do will still trigger the bug. But 1.times do won't trigger it.

thrs << Thread.new { AutoloadTest && nil } can be commented out and still trigger the bug.

At least one child process needs to evaluate AutoloadTest. Other child processes only needs to exist (by executing something like sleep 2).

I have currently observed the source object sometimes being iseq and sometimes being T_ICLASS. Here are two stack traces:

panicked at src/scanning.rs:42:13:
Destination is not an MMTk object. Src: 0x200ffca1e58 dst: 0x2010034d3c0
[2024-08-09T03:46:34Z INFO  mmtk::plan::sticky::immix::global] Full heap GC
[2024-08-09T03:46:34Z INFO  mmtk::policy::immix::defrag] Defrag: true
   0: mmtk_ruby::handle_gc_thread_panic
             at /home/wks/projects/mmtk-github/mmtk-ruby/mmtk/src/lib.rs:102:14
   1: mmtk_ruby::set_panic_hook::{{closure}}
             at /home/wks/projects/mmtk-github/mmtk-ruby/mmtk/src/lib.rs:131:13
   2: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/alloc/src/boxed.rs:2029:9
   3: std::panicking::rust_panic_with_hook
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:785:13
   4: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:659:13
   5: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:171:18
   6: rust_begin_unwind
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:647:5
   7: core::panicking::panic_fmt
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panicking.rs:72:14
   8: <mmtk_ruby::scanning::VMScanning as mmtk::vm::scanning::Scanning<mmtk_ruby::Ruby>>::scan_object_and_trace_edges::{{closure}}
             at /home/wks/projects/mmtk-github/mmtk-ruby/mmtk/src/scanning.rs:42:13
   9: mmtk_ruby::abi::ObjectClosure::c_function_registered
             at /home/wks/projects/mmtk-github/mmtk-ruby/mmtk/src/abi.rs:217:9
  10: rb_mmtk_call_object_closure
             at ./../mmtk_support.c:503:12
  11: rb_mmtk_mark_and_move
             at ./../mmtk_support.c:576:40
  12: rb_gc_impl_mark_and_move
             at ./../gc/default.c:5121:9
  13: rb_gc_mark_and_move
             at ./../gc.c:2456:5
  14: rb_iseq_mark_and_move_each_value
             at ./../iseq.c:278:17
  15: rb_iseq_mark_and_move
             at ./../iseq.c:344:9
  16: rb_imemo_mark_and_move
             at ./../imemo.c:455:9
  17: rb_gc_mark_children
             at ./../gc.c:2588:9
  18: gc_mark_children
             at ./../gc/default.c:5261:5
  19: rb_mmtk_mark_children
             at ./../gc/default.c:10574:5
  20: rb_mmtk_scan_object_ruby_style
             at ./../mmtk_support.c:537:5
  21: <mmtk_ruby::scanning::VMScanning as mmtk::vm::scanning::Scanning<mmtk_ruby::Ruby>>::scan_object_and_trace_edges::{{closure}}
             at /home/wks/projects/mmtk-github/mmtk-ruby/mmtk/src/scanning.rs:59:17
  22: mmtk_ruby::abi::ObjectClosure::set_temporarily_and_run_code
             at /home/wks/projects/mmtk-github/mmtk-ruby/mmtk/src/abi.rs:201:22
  23: <mmtk_ruby::scanning::VMScanning as mmtk::vm::scanning::Scanning<mmtk_ruby::Ruby>>::scan_object_and_trace_edges
             at /home/wks/projects/mmtk-github/mmtk-ruby/mmtk/src/scanning.rs:56:9
  24: mmtk::scheduler::gc_work::ScanObjectsWork::do_work_common::{{closure}}
             at /home/wks/projects/mmtk-github/mmtk-core/src/scheduler/gc_work.rs:873:21
  25: <mmtk::scheduler::gc_work::ProcessEdgesWorkTracerContext<E> as mmtk::vm::scanning::ObjectTracerContext<<E as mmtk::scheduler::gc_work::ProcessEdgesWork>::VM>>::with_tracer
             at /home/wks/projects/mmtk-github/mmtk-core/src/scheduler/gc_work.rs:303:22
  26: mmtk::scheduler::gc_work::ScanObjectsWork::do_work_common
             at /home/wks/projects/mmtk-github/mmtk-core/src/scheduler/gc_work.rs:869:13
  27: <mmtk::scheduler::gc_work::PlanScanObjects<E,P> as mmtk::scheduler::work::GCWork<<E as mmtk::scheduler::gc_work::ProcessEdgesWork>::VM>>::do_work
             at /home/wks/projects/mmtk-github/mmtk-core/src/scheduler/gc_work.rs:1052:9
  28: mmtk::scheduler::work::GCWork::do_work_with_stat
             at /home/wks/projects/mmtk-github/mmtk-core/src/scheduler/work.rs:45:9
  29: mmtk::scheduler::worker::GCWorker<VM>::run
             at /home/wks/projects/mmtk-github/mmtk-core/src/scheduler/worker.rs:244:13
  30: mmtk::memory_manager::start_worker
             at /home/wks/projects/mmtk-github/mmtk-core/src/memory_manager.rs:492:5
  31: <mmtk_ruby::collection::VMCollection as mmtk::vm::collection::Collection<mmtk_ruby::Ruby>>::spawn_gc_thread::{{closure}}
             at /home/wks/projects/mmtk-github/mmtk-ruby/mmtk/src/collection.rs:54:21
  32: std::sys_common::backtrace::__rust_begin_short_backtrace
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:155:18
  33: std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}}
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/thread/mod.rs:529:17
  34: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panic/unwind_safe.rs:272:9
  35: std::panicking::try::do_call
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:554:40
  36: __rust_try
  37: std::panicking::try
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:518:19
  38: std::panic::catch_unwind
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panic.rs:142:14
  39: std::thread::Builder::spawn_unchecked_::{{closure}}
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/thread/mod.rs:528:30
  40: core::ops::function::FnOnce::call_once{{vtable.shim}}
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/ops/function.rs:250:5
  41: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/alloc/src/boxed.rs:2015:9
  42: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/alloc/src/boxed.rs:2015:9
  43: std::sys::pal::unix::thread::Thread::new::thread_start
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys/pal/unix/thread.rs:108:17
  44: <unknown>
  45: <unknown>

And this (with some frames omitted):

ERROR: An MMTk GC thread panicked.  This is a bug.  PID: 444536
panicked at src/scanning.rs:51:13:
Destination is not an MMTk object. Src: 0x200ffc06c10 dst: 0x20100219440
   0: mmtk_ruby::handle_gc_thread_panic
             at /home/wks/projects/mmtk-github/mmtk-ruby/mmtk/src/lib.rs:102:14
   1: mmtk_ruby::set_panic_hook::{{closure}}
             at /home/wks/projects/mmtk-github/mmtk-ruby/mmtk/src/lib.rs:131:13
   2: <alloc::boxed::Box<F,A> as core::ops::function::Fn<Args>>::call
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/alloc/src/boxed.rs:2029:9
   3: std::panicking::rust_panic_with_hook
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:785:13
   4: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:659:13
   5: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/sys_common/backtrace.rs:171:18
   6: rust_begin_unwind
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/std/src/panicking.rs:647:5
   7: core::panicking::panic_fmt
             at /rustc/aedd173a2c086e558c2b66d3743b344f977621a7/library/core/src/panicking.rs:72:14
   8: <mmtk_ruby::scanning::VMScanning as mmtk::vm::scanning::Scanning<mmtk_ruby::Ruby>>::scan_object_and_trace_edges::{{closure}}
             at /home/wks/projects/mmtk-github/mmtk-ruby/mmtk/src/scanning.rs:51:13
   9: mmtk_ruby::abi::ObjectClosure::c_function_registered
             at /home/wks/projects/mmtk-github/mmtk-ruby/mmtk/src/abi.rs:217:9
  10: rb_mmtk_call_object_closure
             at ./../mmtk_support.c:503:12
  11: rb_mmtk_mark
             at ./../mmtk_support.c:519:9
  12: rb_mmtk_mark_movable
             at ./../mmtk_support.c:561:5
  13: gc_mark
             at ./../gc/default.c:5051:9
  14: rb_gc_impl_mark
             at ./../gc/default.c:5147:5
  15: mark_method_entry_i
             at ./../gc.c:2214:5
  16: rb_id_table_foreach_values
             at ./../id_table.c:296:53
  17: mark_m_tbl
             at ./../gc.c:2223:9
  18: rb_gc_mark_children
             at ./../gc.c:2638:9
  19: gc_mark_children
             at ./../gc/default.c:5261:5
  20: rb_mmtk_mark_children
             at ./../gc/default.c:10574:5
  21: rb_mmtk_scan_object_ruby_style
             at ./../mmtk_support.c:537:5
  22: <mmtk_ruby::scanning::VMScanning as mmtk::vm::scanning::Scanning<mmtk_ruby::Ruby>>::scan_object_and_trace_edges::{{closure}}
             at /home/wks/projects/mmtk-github/mmtk-ruby/mmtk/src/scanning.rs:68:17

Some memory dump of the destination object (of the second stack trace):

dest - 8: 28 00 00 00 00 00 00 00   # The hidden size field.  The obj size is 40
dest    : 1a 60 1e 00 00 00 00 00   # 1a: imemo, 6: imemo_ment
dest + 8: 10 6c c0 ff 00 02 00 00
dest +16: 40 16 78 48 12 60 00 00
dest +24: 31 41 00 00 00 00 00 00
dest +32: e0 6a c0 ff 00 02 00 00

Since the header looks like a Ruby flags field, it should be an object that has been considered as dead by the GC, and its space has not been reused.

About forking and threading

When forking, only the current pthread is copied to the child process. That means other threads will die at a random place. mmtk-ruby already informs mmtk-core to let all GC threads to exit, so the problem can be from the other Ruby thread. It could be possible that the other Ruby thread inserted some dangling pointers into the global state while the current thread forked.

The Linux man page states that "in a multi-threaded program, the child can safely call only async-signal-safe functions (see signal-safety(7)) until such time as it calls execve(2)".

wks commented 1 month ago

The "Destination is not an MMTk object" bug is likely caused by not flushing the mod buffers of the threads disappeared at fork(). Since fork() only copies the current thread to the child process, other threads simply vanished, dying at wherever place they are currently running. That gives those threads no chance to flush their mutator-local mod buffers. In a generational GC, the mod buffer records old objects that have young children. Failing to flush mod buffers before the next nursery GC will cause young objects to be erroneously identified as dead while they are actually pointed by old objects.

The following snippet is guaranteed to cause an error when using StickyImmix and automatic heap size. The child process will see "Destination is not an MMTk object" errors in the next full-heap GC.

puts "[p0, t0] Hello world!"

puts "[p0, t0] Creating array..."
ar = Array::new(10, nil)

puts "[p0, t0] Forcing GC so that ar becomes old..."
GC::start

queue = Thread::Queue.new

puts "[p0, t0] Creating thread..."
Thread::new do
  puts "[p0, t1] Adding objects..."
  10.times do |i|
    ar[i] = "Foo #{i}" # This will add ar to the modbuf
  end

  puts "[p0, t1] Telling the main thread that we have done..."
  queue << :done

  puts "[p0, t1] Waiting almost forever..."
  sleep 1000000
end

puts "[p0, t0] Waiting for thread1 to add elements..."
queue.pop

if false # Change this to true and it will not crash
  puts "[p0, t0] Do another GC to make the children of ar old"
  GC::start
end

puts "[p0, t0] Forking..."
p1 = fork do
  puts "[p1, t0] Allocating enough live objects to trigger a full-heap GC..."
  ar2 = []
  10000.times do |i|
    ar2 << "Bar #{i}"
  end

  puts "[p1, t0] I should have crashed!"
  p ar

  exit 0
end

puts "[p0, t0] Joining p1 = #{p1}..."
_, status = Process.waitpid2(p1)
puts "[p0, t0] #{status}"

puts "[p0, t0] Goodbye world!"

The parent process creates a thread t1 to allocate new objects and add references to those objects into a mature array. This will add the array to the modbuf. The thread t1 then waits forever. When the main thread t0 calls fork, t1 simply disappears in the child process. That means the modbuf of t1 will simply be abandoned. The child process then allocate lots of live objects. MMTk will first run a nursery GC and then a full-heap GC when it fails to free up enough memory. The nursery GC will recycle the children of the array, and the full-heap GC will try to reach the dead children from the array and cause "Destination is not MMTk object" error.

If the parent process does a GC before forking (changing if false to if true), the thread t1 will be recognized when GC starts, and its modbuf will be flushed. That will correctly trace the children of the array, keep them alive, and make them old. Then it will not crash after forking.

mmtk / ruby

Error while running TestAutoload#test_autoload_fork #83

About forking and threading