tmm1 / stackprof

a sampling call-stack profiler for ruby 2.2+
MIT License
2.08k stars 128 forks source link

Ensure VM is running in signal handler #200

Closed ianks closed 1 year ago

ianks commented 1 year ago

As of a few weeks ago, there's been an increase of sigabrt in prod for us. After hunting the issue down a bit with @peterzhu2118, we honed in on a case where stackprof is signaled and ruby_current_vm_ptr is null. This seems to be happening after SIGQUIT'ing a unicorn worker during rb_during_gc.

* thread #1, name = 'ruby', stop reason = signal SIGABRT
  * frame #0: 0x00007f10787f100b libc.so.6`raise + 203
    frame #1: 0x00007f10787d0859 libc.so.6`abort + 299
    frame #2: 0x0000562e2c372d4f ruby`rb_bug_for_fatal_signal.cold at error.c:776:5
    frame #3: 0x0000562e2c4d56cd ruby`sigsegv(sig=<unavailable>, info=<unavailable>, ctx=<unavailable>) at signal.c:964:5
    frame #4: 0x00007f10789cf420 libpthread.so.0`__restore_rt
    frame #5: 0x0000562e2c396b0b ruby`rb_during_gc at gc.c:11006:12
    frame #6: 0x00007f1070f45391 stackprof.so`stackprof_signal_handler at stackprof.c:745:34
    frame #7: 0x00007f1070f45382 stackprof.so`stackprof_signal_handler(sig=<unavailable>, sinfo=<unavailable>, ucontext=<unavailable>) at stackprof.c:720
    frame #8: 0x00007f10789cf420 libpthread.so.0`__restore_rt
    frame #9: 0x00007f1066be70f4 libselinux.so.1`___lldb_unnamed_symbol8$$libselinux.so.1 + 36
    frame #10: 0x00007f10790daf6b ld-linux-x86-64.so.2`___lldb_unnamed_symbol59$$ld-linux-x86-64.so.2 + 523
    frame #11: 0x00007f10787f48a7 libc.so.6`___lldb_unnamed_symbol191$$libc.so.6 + 247
    frame #12: 0x00007f10787f4a60 libc.so.6`exit + 32
    frame #13: 0x00007f10787d208a libc.so.6`__libc_start_main + 250
    frame #14: 0x0000562e2c373fde ruby`_start + 46

Since signal safety is hard to reason about, let's add a sanity check and not assume the Ruby VM is active just because the handler is called.

ianks commented 1 year ago

Since ruby_current_vm_ptr is not a public symbol, I had to declare it with as an extern. I think the symbol name has changed in various ruby versions, so it would be best to use stable API to do this check.

Is there any other public API that I can use instead to ensure the VM is running? Otherwise, I’ll plunge forward with detecting the various names of this symbol across Ruby versions.

ivoanjo commented 1 year ago

Perhaps adding an at_exit { StackProf.stop } would be a possible solution? I've ran into this in #157 as well.

peterzhu2118 commented 1 year ago

On Ruby 2.4 and before, it should be called ruby_current_vm rather than ruby_current_vm_ptr. But at any rate, we shouldn't rely on ruby_current_vm_ptr being available for extensions because it isn't marked as public (it isn't wrapped within a RUBY_SYMBOL_EXPORT_BEGIN/RUBY_SYMBOL_EXPORT_END region). AFAIK there isn't a public API to determine if the Ruby VM is available.

ianks commented 1 year ago

Perhaps adding an at_exit { StackProf.stop } would be a possible solution? I've ran into this in #157 as well.

That may help, and is probably a good change in its own right. I would still like to add this guard though since it improves the async signal safety of the handler so it's tricky to reason about.

tenderlove commented 1 year ago

Perhaps adding an at_exit { StackProf.stop } would be a possible solution? I've ran into this in #157 as well.

I think stackprof should probably do this by default. Either that or its callees should check for existence of the VM.

We could check for the VM in stackprof, but that seems kind of fraught as the global is private which means it can be renamed at any time and the compiler is free to do whatever it wants with the symbol name.

ianks commented 1 year ago

We could check for the VM in stackprof, but that seems kind of fraught as the global is private which means it can be renamed at any time and the compiler is free to do whatever it wants with the symbol name.

Yeah I have similar feelings, and very open to other ways of solving this. Maybe a null check in rb_during_gc would be better?

peterzhu2118 commented 1 year ago

Maybe a null check in rb_during_gc would be better?

No I don't think that will solve the underlying problem since the code in stackprof that is crashing will then instead crash a few lines down in places like rb_postponed_job_register_one if the VM doesn't exist.

https://github.com/tmm1/stackprof/blob/52d1df6502eaede5061a2b5d28026e982bcf262a/ext/stackprof/stackprof.c#L745-L764

ianks commented 1 year ago

Maybe a null check in rb_during_gc would be better?

No I don't think that will solve the underlying problem since the code in stackprof that is crashing will then instead crash a few lines down in places like rb_postponed_job_register_one if the VM doesn't exist.

https://github.com/tmm1/stackprof/blob/52d1df6502eaede5061a2b5d28026e982bcf262a/ext/stackprof/stackprof.c#L745-L764

I think you are right.

A good stop gap may be to register a new at_exit proc with our own variable (bool ruby_vm_exited) that we can test against, rather than using the VM*

ianks commented 1 year ago

@tenderlove @peterzhu2118 Thoughts on my updates?

tenderlove commented 1 year ago

@ianks lgtm! Thanks for the patch!

ianks commented 1 year ago

FWIW, I think the failing test may be flaky? I noticed it failing on master as well locally (on 3.2).

tenderlove commented 1 year ago

FWIW, I think the failing test may be flaky? I noticed it failing on master as well locally (on 3.2).

Ya, it's flaky. I'll merge this and ship it. I think we're trying to measure some GC stuff in the tests and it's not predictable.