Open tenderlove opened 1 month ago
cc @eileencodes
I can't run that exact command on my machine because it will exhaust the memory and the OS will start killing processes. We don't normally run with such a large heap size (16GiB) except when using NoGC. But when I removed the --mmtk-max-heap=16GiB
, other crashes may occur.
From the stack trace, one object was pointing to another object, but the target_object
did not have the "valid object bit" (VO bit). It means the target_object
had already been already dead since the last GC, but its space had not yet been reused when the current GC was triggered. I think the interesting thing is the current object being scanned (i.e. the object
parameter of the scan_object_and_trace_edges
function). What type is object
? Why does it point to the dead object?
From the stack trace, it looks like object
is an array, and it is embedded.
frame #19: 0x0000561e5141c089 ruby`gc_ref_update_array(objspace=0x0000561e52df5cb0, v=0x0000020106a6ece0) at gc.c:10497:17
My guess is that the culprit is the following lines in gc_ref_update_array
:
if (rb_gc_obj_slot_size(v) >= rb_ary_size_as_embedded(v)) {
if (rb_ary_embeddable_p(v)) {
rb_ary_make_embedded(v);
}
}
These lines are not guarded by the #if USE_MMTK
macro, so they will still be executed when using MMTk. These lines will copy the payload from the out-of-object buffer (a xmalloc-ed buffer in Ruby's default GC, but an imemo:mmtk-objbuf in Ruby-MMTk), but does not forward any of its elements.
If you can use the rr
(https://rr-project.org/) tool, you can record the execution and replay it, and pinpoint the moment the field was last assigned to.
I looked at the code and found it is impossible for the condition rb_gc_obj_slot_size(v) >= rb_ary_size_as_embedded(v)
to be satisfied when using MMTk. When using MMTk, the allocated size (slot size) of an object never changes. (In theory, we can re-allocate the object with a different size in ObjectModel::copy
, but we are not doing it now.) For rb_ary_size_as_embedded(v)
to be true, if v
is heap (not embedded), its heap capacity needs to be smaller than the "embed capacity" of v
. (See: https://github.com/Shopify/ruby/blob/d833fa5abaa8602333dbb25ddd092e90eac293c1/array.c#L216-L231) But all code paths that reduces the capacity of an array are guarded so that if it gets smaller than the "embed capacity", the array will become embedded again. The only possibility is that the "embed capacity" suddenly increases. That can only happen if the object can be re-allocated into a larger object during GC.
I can't run that exact command on my machine because it will exhaust the memory and the OS will start killing processes. We don't normally run with such a large heap size (16GiB) except when using NoGC
Our work machines have 32GB of memory so we were setting it to 16GB to match the default on CI. Previously we weren't able to reproduce so I theorized that less memory would get us crashes locally more often. Decreasing memory even more than 16GB to 4GB also let's me reproduce crashes I wasn't seeing before (or on CI).
We're seeing a crash when scanning references with MMTk. I'm able to reproduce it like this:
The error is like this:
Here is the backtrace:
It seems like there might be an issue with the book keeping in MMTk. The Ruby objects look correct, but it seems to have a problem with the particular
ObjectReferece
.