jhawthorn / vernier

📏 next generation CRuby profiler
https://vernier.prof/
MIT License
718 stars 15 forks source link

[BUG] pthread_kill failed when running vernier through Sidekiq #51

Closed DimitriosLisenko closed 4 months ago

DimitriosLisenko commented 5 months ago

Hello! Really excited to use vernier! I gave it a go yesterday by integrating it into a Sidekiq middleware and unfortunately the Sidekiq process is crashing with the following error in the middle of (some) jobs:

[BUG] pthread_kill failed
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin23]

-- C level backtrace information -------------------------------------------
/Users/dima/.rvm/rubies/ruby-3.2.2/lib/libruby.3.2.dylib(rb_vm_bugreport+0x9a0) [0x10332f110]
/Users/dima/.rvm/rubies/ruby-3.2.2/lib/libruby.3.2.dylib(rb_bug_without_die+0x15c) [0x1031699e8]
/Users/dima/.rvm/rubies/ruby-3.2.2/lib/libruby.3.2.dylib(rb_bug+0x1c) [0x103446880]
/Users/dima/.rvm/gems/ruby-3.2.2@appstore-rails-7.1/bundler/gems/vernier-0c3f6dd66236/lib/vernier/vernier.bundle(_ZN19GlobalSignalHandler13record_sampleER10LiveSampleP17_opaque_pthread_t+0x64) [0x129e1ef4c]
/Users/dima/.rvm/gems/ruby-3.2.2@appstore-rails-7.1/bundler/gems/vernier-0c3f6dd66236/lib/vernier/vernier.bundle(_ZN13TimeCollector17sample_thread_runEv) [0x129e1ec60]
/Users/dima/.rvm/gems/ruby-3.2.2@appstore-rails-7.1/bundler/gems/vernier-0c3f6dd66236/lib/vernier/vernier.bundle(_ZN13TimeCollector19sample_thread_entryEPv) [0x129e1e578]
/usr/lib/system/libsystem_pthread.dylib(_pthread_start+0x88) [0x189b92034]

I tried generating a backtrace from a core dump, though I'm not sure it's more helpful than the above:

* thread #71
    frame #0: 0x0000000189b5a0dc libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x0000000189b91cc0 libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x0000000189a9da40 libsystem_c.dylib`abort + 180
    frame #3: 0x0000000103169ae8 libruby.3.2.dylib`die at error.c:776:5
    frame #4: 0x0000000103446884 libruby.3.2.dylib`rb_bug(fmt=<unavailable>) at error.c:800:5
    frame #5: 0x0000000129e1ef4c vernier.bundle`GlobalSignalHandler::record_sample(this=0x0000000129e2c368, sample=0x00000002cb080f78, pthread_id=0x00000002b1187000) at vernier.cc:1156:17
  * frame #6: 0x0000000129e1ec60 vernier.bundle`TimeCollector::sample_thread_run(this=0x000000029b6ac8d0) at vernier.cc:1239:58
    frame #7: 0x0000000129e1e578 vernier.bundle`TimeCollector::sample_thread_entry(arg=0x000000029b6ac8d0) at vernier.cc:1275:20
    frame #8: 0x0000000189b92034 libsystem_pthread.dylib`_pthread_start + 136

As an aside, with YJIT enabled it crashes even faster, but that's a separate issue probably.

Let me know if I can provide any additional information, will be happy to.

jhawthorn commented 5 months ago

Thanks for the report! pthread_kill failing likely means the original thread exited, which shouldn't happen (though it is something we could probably more safely recover from).

There's been some significant changes, which haven't yet had a release and it's possible that solves this (though I'm not sure as I didn't expect an existing issue). YJIT being enabled also shouldn't be an issue.

DimitriosLisenko commented 5 months ago

Got it, thank you! Looking forward to the next release in that case :)

UPDATE: FYI gave 0.4.0 a go and still happening:

[BUG] pthread_kill failed
ruby 3.2.2 (2023-03-30 revision e51014f9c0) [arm64-darwin23]

-- Crash Report log information --------------------------------------------
   See Crash Report log file in one of the following locations:
     * ~/Library/Logs/DiagnosticReports
     * /Library/Logs/DiagnosticReports
   for more details.
Don't forget to include the above Crash Report log file in bug reports.

-- C level backtrace information -------------------------------------------
/Users/dima/.rvm/rubies/ruby-3.2.2/lib/libruby.3.2.dylib(rb_vm_bugreport+0x9a0) [0x1053ab110]
/Users/dima/.rvm/rubies/ruby-3.2.2/lib/libruby.3.2.dylib(rb_bug_without_die+0x15c) [0x1051e59e8]
/Users/dima/.rvm/rubies/ruby-3.2.2/lib/libruby.3.2.dylib(rb_bug+0x1c) [0x1054c2880]
/Users/dima/.rvm/gems/ruby-3.2.2@appstore-rails-7.1/gems/vernier-0.4.0/lib/vernier/vernier.bundle(_ZN19GlobalSignalHandler13record_sampleER10LiveSampleP17_opaque_pthread_t+0x90) [0x137f257d4]
/Users/dima/.rvm/gems/ruby-3.2.2@appstore-rails-7.1/gems/vernier-0.4.0/lib/vernier/vernier.bundle(_ZN13TimeCollector17sample_thread_runEv+0x170) [0x137f2563c]
/Users/dima/.rvm/gems/ruby-3.2.2@appstore-rails-7.1/gems/vernier-0.4.0/lib/vernier/vernier.bundle(_ZN13TimeCollector19sample_thread_entryEPv+0x24) [0x137f25258]
/usr/lib/system/libsystem_pthread.dylib(_pthread_start+0x88) [0x180892034]
jhawthorn commented 5 months ago

Thanks for testing on 0.4.0. I'm not sure why this is happening, my only guess is that this is relate to a thread going away, but we have tests for that and Vernier should be able to see the thread going away and not attempt to profile it anymore. I could (and likely will) make this fail more safely, we could recover from this error, but first I'd like to figure out why it is occurring.

Can you provide any more details about which jobs cause this failure or what might be interesting or particular about your sidekiq workload?

jhawthorn commented 4 months ago

I still haven't diagnosed what causes this, but we added a failsafe in https://github.com/jhawthorn/vernier/commit/971f56e144ce3fc6a44032b1923eb57cba09408a and that is released in 0.5.1

That will almost certainly work around this particular error, but if it just turns into a different one please let me know 🙇‍♂

jhawthorn commented 4 months ago

Oh! I think I am seeing evidence of this when thread.kill is used but ONLY on Ruby 3.2 and on macos

DimitriosLisenko commented 4 months ago

Hey John! Sorry for the delay, been trying (and failing) to find a moment to dig a bit deeper into this :)

That does sound promising! We are indeed using Ruby 3.2, I was personally testing on MacOS, and our application does run a bunch of Sidekiq jobs that might use an internal thread pool or other gems that might spawn/kill threads (which I suppose is relatively rare for Sidekiq workloads).

I can try looking out for the specific place it crashes if you'd like - unless you think you have a solution already for the thread.kill issue? As an aside, does it look like both Ruby 3.1 and 3.3 are not affected?

jrochkind commented 2 months ago

I do seem to be running into this too -- Ruby 3.2 and MacOS indeed. I do not use sidekiq, but was trying to use vernier for the first time via test_prof and it's new vernier mode.

My console output is littered with VERNIER BUG: pthread_kill of [differnet numbers] failed.

Indeed, trying to run under Ruby 3.3 instead this problem does not seem to be there? (My app is not running on 3.3 in production yet, although does pass it's tests under 3.3).

I don't know enough about what's going on to know if this is helpful info? Not sure if my vernier output is still good despite those log lines?