Crash when scanning references

tenderlove commented 1 month ago

We're seeing a crash when scanning references with MMTk. I'm able to reproduce it like this:

$ make -j12 test-all RUBY_TESTOPTS="--excludes-dir=test/.excludes-mmtk" RUN_OPTS="--mmtk-plan=Immix --mmtk-max-heap=16GiB" RUST_LOG=none RUST_BACKTRACE=1

The error is like this:

ERROR: An MMTk GC thread panicked.  This is a bug.

Here is the backtrace:

(lldb) bt 25
* thread #1, name = 'ruby', stop reason = signal SIGABRT
  * frame #0: 0x00007f09614789fc libc.so.6`__GI___pthread_kill at pthread_kill.c:44:76
    frame #1: 0x00007f09614789b0 libc.so.6`__GI___pthread_kill [inlined] __pthread_kill_internal(signo=6, threadid=139678026008128) at pthread_kill.c:78:10
    frame #2: 0x00007f09614789b0 libc.so.6`__GI___pthread_kill(threadid=139678026008128, signo=6) at pthread_kill.c:89:10
    frame #3: 0x00007f0961424476 libc.so.6`__GI_raise(sig=6) at raise.c:26:13
    frame #4: 0x00007f096140a7f3 libc.so.6`__GI_abort at abort.c:79:7
    frame #5: 0x00007f0961e4f597 libmmtk_ruby.so`std::sys::unix::abort_internal::h5ea671639b64db50 at mod.rs:376:14
    frame #6: 0x00007f0961820067 libmmtk_ruby.so`std::process::abort::h9bd74f18c581104d at process.rs:2279:5
    frame #7: 0x00007f09619404de libmmtk_ruby.so`mmtk_ruby::handle_gc_thread_panic::hae356fc76122fafe(panic_info=0x00007f09531e6418) at lib.rs:118:5
    frame #8: 0x00007f09618c8aec libmmtk_ruby.so`mmtk_ruby::set_panic_hook::_$u7b$$u7b$closure$u7d$$u7d$::h7a90161210938483(panic_info=0x00007f09531e6418) at lib.rs:130:13
    frame #9: 0x00007f0961e5b50a libmmtk_ruby.so`std::panicking::rust_panic_with_hook::hbcae08ba0ccf4c11 [inlined] _$LT$alloc..boxed..Box$LT$F$C$A$GT$$u20$as$u20$core..ops..function..Fn$LT$Args$GT$$GT$::call::h0cb0cdcc8d325957(self=<unavailable>, args=(&core::panic::panic_info::PanicInfo) @ scalar) at boxed.rs:2021:9
    frame #10: 0x00007f0961e5b4f8 libmmtk_ruby.so`std::panicking::rust_panic_with_hook::hbcae08ba0ccf4c11(payload=&mut dyn core::panic::PanicPayload @ 0x00000000043f7b80, message=<unavailable>, location=0x00007f0962020ac0, can_unwind=true, force_no_backtrace=false) at panicking.rs:783:13
    frame #11: 0x00007f0961e3b63e libmmtk_ruby.so`std::panicking::begin_panic_handler::_$u7b$$u7b$closure$u7d$$u7d$::h9083954869cd1afa at panicking.rs:657:13
    frame #12: 0x00007f0961e3b476 libmmtk_ruby.so`std::sys_common::backtrace::__rust_end_short_backtrace::hb8caade51576a0bc(f=<unavailable>) at backtrace.rs:170:18
    frame #13: 0x00007f0961e5b0c2 libmmtk_ruby.so`rust_begin_unwind(info=<unavailable>) at panicking.rs:645:5
    frame #14: 0x00007f0961821a45 libmmtk_ruby.so`core::panicking::panic_fmt::hef0307862026e6f9(fmt=<unavailable>) at panicking.rs:72:14
    frame #15: 0x00007f09618f21fe libmmtk_ruby.so`_$LT$mmtk_ruby..scanning..VMScanning$u20$as$u20$mmtk..vm..scanning..Scanning$LT$mmtk_ruby..Ruby$GT$$GT$::scan_object_and_trace_edges::_$u7b$$u7b$closure$u7d$$u7d$::h7b36c45ab74d703c(_worker=0x0000561e5303dee0, target_object=ObjectReference @ 0x00007f09531e6598, pin=false) at scanning.rs:42:13
    frame #16: 0x00007f09619f1c7e libmmtk_ruby.so`mmtk_ruby::abi::ObjectClosure::c_function_registered::h128735149f1fdd01(rust_closure=0x00007f09531e6a58, worker=0x0000561e5303dee0, object=ObjectReference @ 0x00007f09531e6858, pin=false) at abi.rs:217:9
    frame #17: 0x0000561e51481404 ruby`rb_mmtk_call_object_closure(object=0x000002011096d958, pin=false) at mmtk_support.c:504:12
    frame #18: 0x0000561e514815f7 ruby`rb_mmtk_maybe_forward(value=0x000002011096d958) at mmtk_support.c:599:23
    frame #19: 0x0000561e5141c089 ruby`gc_ref_update_array(objspace=0x0000561e52df5cb0, v=0x0000020106a6ece0) at gc.c:10497:17
    frame #20: 0x0000561e5141d52b ruby`gc_update_object_references(objspace=0x0000561e52df5cb0, obj=0x0000020106a6ece0) at gc.c:10942:9
    frame #21: 0x0000561e51426926 ruby`rb_mmtk_update_object_references(obj=0x0000020106a6ece0) at gc.c:14914:5
    frame #22: 0x0000561e51481489 ruby`rb_mmtk_scan_object_ruby_style(object=0x0000020106a6ece0) at mmtk_support.c:539:5
    frame #23: 0x00007f09618f456a libmmtk_ruby.so`_$LT$mmtk_ruby..scanning..VMScanning$u20$as$u20$mmtk..vm..scanning..Scanning$LT$mmtk_ruby..Ruby$GT$$GT$::scan_object_and_trace_edges::_$u7b$$u7b$closure$u7d$$u7d$::hb64ebe3b4934749e at scanning.rs:59:17
    frame #24: 0x00007f09619ebfdc libmmtk_ruby.so`mmtk_ruby::abi::ObjectClosure::set_temporarily_and_run_code::h75eee1ae6629753c(self=0x00007f0948001670, visit_object={closure_env#0}<mmtk::scheduler::gc_work::ProcessEdgesWorkTracer<mmtk::scheduler::gc_work::PlanProcessEdges<mmtk_ruby::Ruby, mmtk::plan::immix::global::Immix<mmtk_ruby::Ruby>, 1>>> @ 0x00007f09531e6a58, f={closure_env#1}<mmtk::scheduler::gc_work::ProcessEdgesWorkTracer<mmtk::scheduler::gc_work::PlanProcessEdges<mmtk_ruby::Ruby, mmtk::plan::immix::global::Immix<mmtk_ruby::Ruby>, 1>>> @ 0x00007f09531e6aa8) at abi.rs:201:22

It seems like there might be an issue with the book keeping in MMTk. The Ruby objects look correct, but it seems to have a problem with the particular ObjectReferece.

(lldb) f 15
frame #15: 0x00007f09618f21fe libmmtk_ruby.so`_$LT$mmtk_ruby..scanning..VMScanning$u20$as$u20$mmtk..vm..scanning..Scanning$LT$mmtk_ruby..Ruby$GT$$GT$::scan_object_and_trace_edges::_$u7b$$u7b$closure$u7d$$u7d$::h7b36c45ab74d703c(_worker=0x0000561e5303dee0, target_object=ObjectReference @ 0x00007f09531e6598, pin=false) at scanning.rs:42:13
   39                   target_object,
   40                   if pin { " pin" } else { "" }
   41               );
-> 42               debug_assert!(
   43                   mmtk::memory_manager::is_mmtk_object(target_object.to_raw_address()).is_some(),
   44                   "Destination is not an MMTk object. Src: {object} dst: {target_object}"
   45               );
libmmtk_ruby.so`_$LT$mmtk_ruby..scanning..VMScanning$u20$as$u20$mmtk..vm..scanning..Scanning$LT$mmtk_ruby..Ruby$GT$$GT$::scan_object_and_trace_edges::_$u7b$$u7b$closure$u7d$$u7d$::h7b36c45ab74d703c:
->  0x7f09618f21fe <+942>: ud2    
    0x7f09618f2200 <+944>: mov    rax, qword ptr [rsp + 0x150]
    0x7f09618f2208 <+952>: add    rsp, 0x2d8
    0x7f09618f220f <+959>: ret    
(lldb) p target_object
(mmtk::util::address::ObjectReference) $1 = {
  __0 = (__0 = 2203596544344)
}
(lldb) rp 2203596544344
bits: [     ]
T_STRING: [UTF_8] [7BIT] (const char[5]) $3 = "line1"
(lldb) p object
(mmtk::util::address::ObjectReference) $4 = {
  __0 = (__0 = 2203429825760)
}
(lldb) rp 2203429825760
bits: [     ]
T_ARRAY: len=2 (embed)
(const const VALUE[1]) $6 = ([0] = 0x000002011096d958)
(lldb)

tenderlove commented 1 month ago

cc @eileencodes

wks commented 1 month ago

I can't run that exact command on my machine because it will exhaust the memory and the OS will start killing processes. We don't normally run with such a large heap size (16GiB) except when using NoGC. But when I removed the --mmtk-max-heap=16GiB, other crashes may occur.

From the stack trace, one object was pointing to another object, but the target_object did not have the "valid object bit" (VO bit). It means the target_object had already been already dead since the last GC, but its space had not yet been reused when the current GC was triggered. I think the interesting thing is the current object being scanned (i.e. the object parameter of the scan_object_and_trace_edges function). What type is object? Why does it point to the dead object?

From the stack trace, it looks like object is an array, and it is embedded.

    frame #19: 0x0000561e5141c089 ruby`gc_ref_update_array(objspace=0x0000561e52df5cb0, v=0x0000020106a6ece0) at gc.c:10497:17

My guess is that the culprit is the following lines in gc_ref_update_array:

        if (rb_gc_obj_slot_size(v) >= rb_ary_size_as_embedded(v)) {
            if (rb_ary_embeddable_p(v)) {
                rb_ary_make_embedded(v);
            }
        }

These lines are not guarded by the #if USE_MMTK macro, so they will still be executed when using MMTk. These lines will copy the payload from the out-of-object buffer (a xmalloc-ed buffer in Ruby's default GC, but an imemo:mmtk-objbuf in Ruby-MMTk), but does not forward any of its elements.

If you can use the rr (https://rr-project.org/) tool, you can record the execution and replay it, and pinpoint the moment the field was last assigned to.

wks commented 1 month ago

I looked at the code and found it is impossible for the condition rb_gc_obj_slot_size(v) >= rb_ary_size_as_embedded(v) to be satisfied when using MMTk. When using MMTk, the allocated size (slot size) of an object never changes. (In theory, we can re-allocate the object with a different size in ObjectModel::copy, but we are not doing it now.) For rb_ary_size_as_embedded(v) to be true, if v is heap (not embedded), its heap capacity needs to be smaller than the "embed capacity" of v. (See: https://github.com/Shopify/ruby/blob/d833fa5abaa8602333dbb25ddd092e90eac293c1/array.c#L216-L231) But all code paths that reduces the capacity of an array are guarded so that if it gets smaller than the "embed capacity", the array will become embedded again. The only possibility is that the "embed capacity" suddenly increases. That can only happen if the object can be re-allocated into a larger object during GC.

eileencodes commented 1 month ago

I can't run that exact command on my machine because it will exhaust the memory and the OS will start killing processes. We don't normally run with such a large heap size (16GiB) except when using NoGC

Our work machines have 32GB of memory so we were setting it to 16GB to match the default on CI. Previously we weren't able to reproduce so I theorized that less memory would get us crashes locally more often. Decreasing memory even more than 16GB to 4GB also let's me reproduce crashes I wasn't seeing before (or on CI).

mmtk / mmtk-ruby

Crash when scanning references #91