jvm-profiling-tools / honest-profiler

A sampling JVM profiler without the safepoint sample bias
https://github.com/RichardWarburton/honest-profiler/wiki
MIT License
1.25k stars 146 forks source link

Intermittent crashes with OpenJDK 9 #137

Open cinquin opened 8 years ago

cinquin commented 8 years ago

I have an issue that is possibly not addressable within Honest Profiler itself and that might be more suitable for the HotSpot development mailing list, but I thought this might be a good place to start and see if anyone can provide insights. I've been getting intermittent SIGSEGV crashes running (under FreeBSD) the BSD port of OpenJDK 9, roughly equivalent to early access build b117, and a version of Honest Profiler with the commit in my pull request #108 (which fixes a different sort of crash). The new crashes do not occur with OpenJDK 8.

An example trace is as follows:

#17 0x000000173a8c63c1 in VMError::report_and_die (thread=0x0, sig=<error reading variable: Value out of range.>, pc=0x7d0 <error: Cannot access memory at address 0x7d0>, siginfo=0x1738b9593a <_write+10>, context=0x0) at /home/olivier/openjdk9/hotspot/src/share/vm/utilities/vmError.cpp:1071
#18 0x000000173a68889b in JVM_handle_bsd_signal (sig=11, info=0x7fffd605d0b0, ucVoid=0x7fffd605cd40, abort_if_unrecognized=<optimized out>) at /home/olivier/openjdk9/hotspot/src/os_cpu/bsd_x86/vm/os_bsd_x86.cpp:785
#19 0x000000173a6845a9 in signalHandler (sig=11, info=0x7fffd605d0b0, uc=0x7fffd605cd40) at /home/olivier/openjdk9/hotspot/src/os/bsd/vm/os_bsd.cpp:2972
#20 0x0000001738838977 in handle_signal (actp=<optimized out>, sig=11, info=0x7fffd605d0b0, ucp=0x7fffd605cd40) at /usr/src/lib/libthr/thread/thr_sig.c:243
#21 0x000000173883806c in thr_sighandler (sig=11, info=0x7fffd605d0b0, _ucp=0x7fffd605cd40) at /usr/src/lib/libthr/thread/thr_sig.c:188
#22 <signal handler called>
#23 frame::safe_for_sender (this=<optimized out>, thread=0x17b8cc1800) at /home/olivier/openjdk9/hotspot/src/cpu/x86/vm/frame_x86.cpp:135
#24 0x000000173a8674bc in JavaThread::pd_get_top_frame (this=0x17b8cc1800, fr_addr=0x7fffd605d510, ucontext=<optimized out>, isInJava=<optimized out>) at /home/olivier/openjdk9/hotspot/src/os_cpu/bsd_x86/vm/thread_bsd_x86.cpp:77
#25 0x000000173a171412 in AsyncGetCallTrace (trace=0x7fffd605d5b8, depth=128, ucontext=0x7fffd605e240) at /home/olivier/openjdk9/hotspot/src/share/vm/prims/forte.cpp:598
#26 0x000000173c22ca6e in Profiler::handle (this=0x173bc31330, signum=<optimized out>, info=<optimized out>, context=0x7fffd605e240) at /home/olivier/honest-profiler-3/src/main/cpp/profiler.cpp:79
#27 0x0000001738838977 in handle_signal (actp=<optimized out>, sig=27, info=0x7fffd605e5b0, ucp=0x7fffd605e240) at /usr/src/lib/libthr/thread/thr_sig.c:243
#28 0x000000173883806c in thr_sighandler (sig=27, info=0x7fffd605e5b0, _ucp=0x7fffd605e240) at /usr/src/lib/libthr/thread/thr_sig.c:188
#29 <signal handler called>
#30 0x00000017408a9860 in ?? ()
#31 0x0000001911e62cf8 in ?? ()
#32 0x00007fffd605e858 in ?? ()
#33 0x0000001766d85d8b in ?? ()
#34 0x00007fffd605e8b0 in ?? ()
#35 0x000000179fa3d308 in ?? ()
#36 0x0000000000000000 in ?? ()

The crash occurs in frame 23, which maps to the following line in openjdk9/hotspot/src/cpu/x86/vm/frame_x86.cpp:

sender_pc = (address) this->fp()[return_addr_offset];

There seems to be a check just above that fp is "safe"; perhaps that test is deficient.

This code seems to have changed between OpenJDK 8 and OpenJDK 9. I don't know enough about the way the JVM deals with frames to understand what the code is doing and how it ends with a return value for fp() that is invalid (if I remember correctly the value for return_addr_offset was 1 according to GDB, and was thus unlikely to be the problem).

Does anyone have insights as to what the problem might be, and how to fix it?

nitsanw commented 8 years ago

As you state, this looks like a JVM bug rather than an Honest-Profiler bug. It would be valuable to report it on the relevant OpenJDK mailing list and link here maybe? I'm vaguely familiar with the code in this area, but there are far more qualified people in the OpenJDK team.

cinquin commented 8 years ago

@nitsanw Yes I think I'll report it to the HotSpot folks. It would be useful to know if the problem is specific to my OpenJDK build or platform (especially since I don't have the problem reduced to a simple test case, and don't have a very convenient way of trying to reproduce it on an officially-supported OS). Has anyone been using Honest Profiler with OpenJDK 9? How is it working for you?

RichardWarburton commented 8 years ago

I have not, I think its worth trying to reproduce the issue without honest profiler running to see if its just a pure JDK9 problem.

cinquin commented 8 years ago

reproduce the issue without honest profiler running

The crashes only occur with Honest Profiler. (And the stack trace in my original post shows that the problem is in a method called by AsyncGetCallTrace, which is itself of course called by Honest Profiler.)

RichardWarburton commented 8 years ago

My apologies - I was reading this thread too fast ;) +1 to @nitsanw 's suggestion of reporting it on the openjdk mailing list.

nitsanw commented 8 years ago

@cinquin if you can make Solaris Studio work on that platform you should see same issue, also with other AGCT profilers:

JMC might have same issue.

cinquin commented 8 years ago

@nitsanw Yes I imagine I would see the same issue. But to get this fixed by the OpenJDK people it would probably help if the problem could be reproduced on an officially-supported OS; I'm not set up to do that right now.

Incidentally, it might be a good idea to add a set of "stress tests" to the tests that are already part of the project. The other types of crashes that I've fixed so far only occur when the timing is just right (or rather, just wrong) and I don't remember them being triggered by the test suite.