Closed wks closed 3 weeks ago
It can also be observed while running test_autoload_fork
. Similar to TestProcess#test_wait_and_sigchild
, the stack trace is similar above rb_f_fork()
. It's likely to be caused by the same bug, and is related to forking.
/home/wks/projects/mmtk-github/ruby/build-release/ruby(rb_print_backtrace+0x14) [0x5608c27e5243] ../vm_dump.c:824
/home/wks/projects/mmtk-github/ruby/build-release/ruby(rb_vm_bugreport) ../vm_dump.c:1155
/home/wks/projects/mmtk-github/ruby/build-release/ruby(bug_report_end+0x0) [0x5608c279d210] ../error.c:1095
/home/wks/projects/mmtk-github/ruby/build-release/ruby(rb_bug_without_die) ../error.c:1095
/home/wks/projects/mmtk-github/ruby/build-release/ruby(die+0x0) [0x5608c23d6a18] ../error.c:1103
/home/wks/projects/mmtk-github/ruby/build-release/ruby(rb_bug) ../error.c:1105
/home/wks/projects/mmtk-github/ruby/build-release/ruby(obj_free_object_id+0x0) [0x5608c23c782f] ../gc/default.c:3372
/home/wks/projects/mmtk-github/ruby/build-release/ruby(finalize_list) ../gc/default.c:3343
/home/wks/projects/mmtk-github/ruby/build-release/ruby(finalize_deferred_heap_pages) ../gc/default.c:3431
/home/wks/projects/mmtk-github/ruby/build-release/ruby(finalize_deferred_heap_pages) ../gc/default.c:3420
/home/wks/projects/mmtk-github/ruby/build-release/ruby(rb_gc_impl_shutdown_call_finalizer+0x4c2) [0x5608c2401532] ../gc/default.c:3606
/home/wks/projects/mmtk-github/ruby/build-release/ruby(rb_ec_finalize+0x16) [0x5608c23dbfa6] ../eval.c:164
/home/wks/projects/mmtk-github/ruby/build-release/ruby(rb_ec_cleanup) ../eval.c:255
/home/wks/projects/mmtk-github/ruby/build-release/ruby(ruby_cleanup+0x9) [0x5608c23dc5f6] ../eval.c:178
/home/wks/projects/mmtk-github/ruby/build-release/ruby(ruby_stop) ../eval.c:290
/home/wks/projects/mmtk-github/ruby/build-release/ruby(rb_f_fork+0x1f) [0x5608c24aaf34] ../process.c:4401
/home/wks/projects/mmtk-github/ruby/build-release/ruby(rb_f_fork) ../process.c:4391
/home/wks/projects/mmtk-github/ruby/build-release/ruby(vm_call_cfunc_with_frame_+0xb3) [0x5608c259e6fe] ../vm_insnhelper.c:3786
/home/wks/projects/mmtk-github/ruby/build-release/ruby(vm_call_cfunc_with_frame) ../vm_insnhelper.c:3832
/home/wks/projects/mmtk-github/ruby/build-release/ruby(vm_call_cfunc_other) ../vm_insnhelper.c:3858
/home/wks/projects/mmtk-github/ruby/build-release/ruby(vm_sendish+0xcf) [0x5608c25a525f] ../vm_insnhelper.c:5954
/home/wks/projects/mmtk-github/ruby/build-release/ruby(vm_exec_core+0x8a5) [0x5608c25bddf5] ../insns.def:871
/home/wks/projects/mmtk-github/ruby/build-release/ruby(rb_vm_exec+0x255) [0x5608c25ae455] ../vm.c:2575
/home/wks/projects/mmtk-github/ruby/build-release/ruby(rb_vm_invoke_proc+0x5e) [0x5608c25b295e] ../vm.c:1743
/home/wks/projects/mmtk-github/ruby/build-release/ruby(thread_do_start_proc+0x13d) [0x5608c25599fd] ../thread.c:628
/home/wks/projects/mmtk-github/ruby/build-release/ruby(thread_do_start+0x12) [0x5608c255be0a] ../thread.c:645
/home/wks/projects/mmtk-github/ruby/build-release/ruby(thread_start_func_2) ../thread.c:708
/home/wks/projects/mmtk-github/ruby/build-release/ruby(call_thread_start_func_2+0x0) [0x5608c255cc05] ../thread_pthread.c:2236
/home/wks/projects/mmtk-github/ruby/build-release/ruby(nt_start) ../thread_pthread.c:2281
When I ran the tests, I was still using the whitelist, so test_gc.rb
and test_objectspace.rb
were not enabled. Therefore, all the final jobs added during tests are instances of Tempfile
. Currently, test_argf.rb
, test_syntax.rb
and test_io_buffer.rb
use Tempfile
during test.
The crash happens in the child process after forking. It needs to be reproduced by
Tempfile
instances,The following command is able to reproduce the bug with a high probability:
make test-all TESTS='../test/ruby/test_io_buffer.rb ../test/ruby/test_syntax.rb ../test/ruby/test_autoload.rb' RUN_OPTS="--mmtk-plan=StickyImmix" TESTOPTS="-v -n /ascii\\|line\\|read\\|write\\|fork/"
Here is a stacktrace collected in debug mode:
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(rb_print_backtrace+0x20) [0x562c8ee67301] ../vm_dump.c:824
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(rb_vm_bugreport+0x28c) [0x562c8ee67a18] ../vm_dump.c:1155
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(rb_bug_without_die+0xb8) [0x562c8ec33d21] ../error.c:1095
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(rb_bug+0xb1) [0x562c8ec33dfa] ../error.c:1103
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(rb_mmtk_run_final_job+0x53) [0x562c8ec58db9] ../gc/default.c:3382
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(rb_mmtk_run_final_jobs+0x1a) [0x562c8ec58e8f] ../gc/default.c:3424
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(finalize_deferred_heap_pages+0x21) [0x562c8ec58ecb] ../gc/default.c:3434
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(rb_gc_impl_shutdown_call_finalizer+0xcb) [0x562c8ec59412] ../gc/default.c:3622
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(rb_objspace_call_finalizer+0x11) [0x562c8ec69dd0] ../gc.c:1610
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(rb_ec_finalize+0x22) [0x562c8ec3ed46] ../eval.c:164
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(rb_ec_cleanup+0x45b) [0x562c8ec3f1fe] ../eval.c:258
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(ruby_cleanup+0x23) [0x562c8ec3ed9d] ../eval.c:178
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(ruby_stop+0x15) [0x562c8ec3f41a] ../eval.c:294
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(rb_f_fork+0x57) [0x562c8ed1b432] ../process.c:4401
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(ractor_safe_call_cfunc_0+0x2c) [0x562c8ee2abc2] ../vm_insnhelper.c:3602
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(vm_call_cfunc_with_frame_+0x262) [0x562c8ee2b7d0] ../vm_insnhelper.c:3786
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(vm_call_cfunc_with_frame+0x72) [0x562c8ee2bade] ../vm_insnhelper.c:3832
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(vm_call_cfunc_other+0x22f) [0x562c8ee2bd0f] ../vm_insnhelper.c:3858
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(vm_call_cfunc+0x143) [0x562c8ee2c14a] ../vm_insnhelper.c:3940
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(vm_call_method_each_type+0x1bd) [0x562c8ee2f1e4] ../vm_insnhelper.c:4765
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(vm_call_method+0x17f) [0x562c8ee2fd66] ../vm_insnhelper.c:4902
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(vm_call_general+0x2b) [0x562c8ee2ff11] ../vm_insnhelper.c:4935
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(vm_sendish+0x2d3) [0x562c8ee328d3] ../vm_insnhelper.c:5954
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(vm_exec_core+0x7585) [0x562c8ee3e19c] ../insns.def:871
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(rb_vm_exec+0x12a) [0x562c8ee5a0ad] ../vm.c:2575
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(invoke_block+0xda) [0x562c8ee56958] ../vm.c:1524
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(invoke_iseq_block_from_c+0x20f) [0x562c8ee57473] ../vm.c:1594
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(invoke_block_from_c_proc) ../vm.c:1692
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(vm_invoke_proc) ../vm.c:1722
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(rb_vm_invoke_proc+0xa6) [0x562c8ee57c54] ../vm.c:1743
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(thread_do_start_proc+0x2ae) [0x562c8eddc48f] ../thread.c:628
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(thread_do_start+0x57) [0x562c8eddc500] ../thread.c:645
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(thread_start_func_2+0x3d4) [0x562c8eddc93e] ../thread.c:708
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(call_thread_start_func_2+0x64) [0x562c8edd359d] ../thread_pthread.c:2236
/home/wks/projects/mmtk-github/ruby/build-debug/ruby(nt_start+0x1d9) [0x562c8edd378d] ../thread_pthread.c:2281
/usr/lib/libc.so.6(0x714b8190e39d) [0x714b8190e39d]
/usr/lib/libc.so.6(0x714b8199349c) [0x714b8199349c]
Actually it is not related to forking, but the child process executes the shutting-down sequence which triggers the error.
The bug is triggered if both of the following are true on exit
finalizer_table
is emptyHere is a program that can trigger the bug (single-threaded, no forking)
puts "Hello world!"
def make_finalizer(msg)
proc do
puts msg
end
end
puts "Make a finalizable garbage"
ObjectSpace.define_finalizer(Object.new, make_finalizer("Finalizing..."))
puts "Trigger a GC by allocating a large object..."
"x"*1000000000
puts "Goodbye world!"
The so-called "final jobs" are the counterpart of "zombies" in the vanilla CRuby, but they are allocated using malloc. They are created during GC. Some "final jobs" run finalizers (i.e. "finalizer jobs") and others call dfree
for T_DATA
instances (i.e. "dfree jobs"). They are created from entries of the finalizer_table
that point to dead objects, and are enqueued onto heap_pages_deferred_final
.
Currently, if a GC is triggered by GC.start
, then the thread that calls GC.start
will "volunteer" to be the thread to run gc_finalize_deferred
. It will drain the linked list heap_pages_deferred_final
. However, if a GC is triggered by allocation, it will not run any "final jobs". This is a flaw, but since there is no guarantee how soon finalizers will be executed, those finalizers will eventually be executed when the VM quits. So far so good.
However, during shutting down, the current code asserts that after calling obj_free
for all objects (actually only some important types including files), there are no "final jobs" for finalizers, only those for "dfree". The problematic segment is (from rb_gc_impl_shutdown_call_finalizer
):
while (finalizer_table->num_entries) { // REPEAT UNTIL finalizer_table IS EMPTY
st_foreach(finalizer_table, rb_mmtk_evacuate_finalizer_table_on_exit_i, 0);
finalize_deferred_heap_pages(objspace);
}
rb_gc_set_obj_free_on_exit_started();
rb_mmtk_call_obj_free_on_exit();
finalize_deferred_heap_pages(objspace); // ASSERTION FAILURE HERE
It first drains the finalizer_table
so that all finalizers are executed before any obj_free
is executed. It drains the finalizer_table
repeatedly because one finalizer may define another finalizer.
The bug is, if finalizer_table
is empty from the beginning, it will not call finalize_deferred_heap_pages
at all, assuming there is no pending "final jobs". However, we don't drain the linked list heap_pages_deferred_final
if the last GC is triggered by allocation. That means there may be pending final jobs.
I decided not to run any final jobs after allocation failure because that would require a check after every allocation, which hurts the performance of allocation fast paths. It may be possible to run "final jobs" in the end of block_for_gc
, but block_for_gc
is called during allocation (mmtk::memory_manager::alloc
). We never expected alloc
to be reentrant in a single thread. Obviously, if we do so, the inner alloc
will overwrite the bump pointer of the outer alloc
. So for simplicity, I simply omitted draining heap_pages_deferred_final
after allocation.
It sets an interrupt trigger. The call chain is: newobj_slowpath
-> newobj_alloc
-> newobj_cache_miss
-> heap_next_free_page
-> heap_prepare
-> gc_continue
-> gc_sweep_continue
-> gc_sweep_rest
-> gc_sweep_step
-> gc_sweep_page
-> gc_finalize_deferred_register
-> rb_postponed_job_trigger(objspace->finalize_deferred_pjob)
The interrupt is checked (RUBY_VM_CHECK_INTS
) in many places in the VM, and will execute finalize_deferred_pjob
which is bound to gc_finalize_deferred
. One such place is vm_pop_frame
which is called when a function returns.
When using MMTk, we can also call rb_postponed_job_trigger(objspace->finalize_deferred_pjob)
to schedule the execution of "final jobs".
But I think this still doesn't guarantee "final jobs" will be executed before exiting. So in rb_gc_impl_shutdown_call_finalizer
, we should call finalize_deferred_heap_pages
unconditionally at least once to ensure pending final jobs on heap_pages_deferred_final
are drained regardless whether the finalizer_table
is empty or not.
Fixed in https://github.com/mmtk/ruby/commit/1c4102e7bcee03f66ca72f916cde0d2bd7a8638b (force pushed to fix a typo)
I think things will still go wrong if one thread forks while another thread is running finalizers. The child process will see that finalization is still in progress (objspace->atomic_flags.finalizing == 1
) and skip the finalization on exit, and may result in files not flushed on exit, and, more seriously, seeing some finalizers halfway executed. As long as tests are passing, I will not prioritize fixing that because it is an error to fork while having multiple threads.
There is a random failure when running
make test-all
in thedev/mmtk-overrides-default
branch. It prints the message "[BUG] Finalize job still exists after obj_free on exit has started.", but the stack trace shows it is inobj_free_object_id
. I think the stack trace is inaccurate because it is the release build. It is random. I observed this error when runningmake test-all
(using whitelist as before) on my laptop, but I am not able to reproduce this crash.