jvm-profiling-tools / honest-profiler

A sampling JVM profiler without the safepoint sample bias
https://github.com/RichardWarburton/honest-profiler/wiki
MIT License
1.25k stars 146 forks source link

JVM hangs on Mac OS X while profiling #80

Open RajivSignal opened 9 years ago

RajivSignal commented 9 years ago

I built the profiler from source after installing unittest-cpp from source and specifying the library to link as well as the include directories. I don't think I got any errors from the maven or the underlying make invocation. I got the liblagent dynamic library as expected in my builds directory. I had jdk 1.8 from Oracle installed for building honest-profiler as well as my own jar. I t started my jvm like this: java -agentpath:/Users/rajiv/honest-profiler/build/liblagent.dylib -jar ~/Downloads/perftest.jar

It says start and then just keeps spinning there forever consuming a ton of cpu till I kill it. So something like this:

Rajivs-MacBook-Pro:build rajiv$ java -agentpath:/Users/rajiv/honest-profiler/build/liblagent.dylib -jar ~/Downloads/perftest.jar Start My application prints something to the standard line as soon as it starts and I don't see any of that being printed either, so my guess is that my application did not even end up being started.

toddwest commented 9 years ago

I'm running into this exact same issue on OS X (Mavericks).

I can see that it's starting to execute some of my code but it appears to be doing it extremely slowly. The profiler is also generating about 18MB of logs per minute and is completely saturating my CPU.

cinquin commented 8 years ago

I have had a problem under Mac OS X (10.10.5 or 10.11.1) that is probably the same as described above. This problem does not seem to occur when sampling at large intervals. I got stack traces using lldb (example below) that show a thread performing the profiling that got a SIGSEGV signal, which is handled by the JVM's signal handler. As far as I can tell the thread just stays stuck in that state (the process itself does not crash, and the PROF signal keeps being delivered to other threads).

The problem seems to occur in a call to pthread_getspecific made by jni_GetEnv. Some Googling suggested that pthread_getspecific is not guaranteed to be async-safe, but that in practice it likely is.

* thread #37: tid = 0x597b4, 0x00007fff84d71bc5 libsystem_kernel.dylib`__error, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x00007fff84d71bc5 libsystem_kernel.dylib`__error
    frame #1: 0x0000000106915a59 libjvm.dylib`signalHandler(int, __siginfo*, void*) + 25
    frame #2: 0x00007fff92116eaa libsystem_platform.dylib`_sigtramp + 26
    frame #3: 0x00007fff83f8d762 libsystem_pthread.dylib`pthread_getspecific + 1
    frame #4: 0x00000001067b5a3d libjvm.dylib`jni_GetEnv + 118
    frame #5: 0x000000010746ce52 liblagent_old_debug.dylib`Profiler::handle(int, __siginfo*, void*) [inlined] JavaVM_::GetEnv(this=<unavailable>, penv=<unavailable>, version=65542) + 98 at jni.h:1921
    frame #6: 0x000000010746ce40 liblagent_old_debug.dylib`Profiler::handle(int, __siginfo*, void*) [inlined] Profiler::getJNIEnv(this=0x00007f94aa403d20) + 14 at profiler.cpp:92
    frame #7: 0x000000010746ce32 liblagent_old_debug.dylib`Profiler::handle(this=0x00007f94aa403d20, signum=<unavailable>, info=<unavailable>, context=0x00007000023ffe20) + 66 at profiler.cpp:76
    frame #8: 0x00007fff92116eaa libsystem_platform.dylib`_sigtramp + 26
    frame #9: 0x00007fff84d775af libsystem_kernel.dylib`__thread_selfid + 11
    frame #10: 0x00007fff83f8d2f6 libsystem_pthread.dylib`_pthread_set_self + 25
    frame #11: 0x00007fff83f8fb9e libsystem_pthread.dylib`_pthread_body + 14
    frame #12: 0x00007fff83f8fb90 libsystem_pthread.dylib`_pthread_start + 168
    frame #13: 0x00007fff83f8d375 libsystem_pthread.dylib`thread_start + 13
cinquin commented 8 years ago

So I've looked into this a bit more. The problem seems to boil down to the fact that GetEnv is not async-safe before a thread can safely read from its thread-local storage (GetEnv relies on thread-local storage). But the PROF signal starts being deliverable to a thread very early (and probably immediately) after the thread is created, even before the native thread is fully initialized. The shorter the sampling interval, the more likely it is that a new thread will receive the PROF signal before it can handle it without crashing in GetEnv.

Unfortunately, I do not see a practical way of determining from the signal handler whether the current thread has been fully initialized (if there was a way to do that, a thread could just skip profiling until it's ready). I see the following potential solutions to the problem: 1) Globally suspend profiling while a new thread is being created 2) Temporarily block the PROF signal in threads that spawn a new thread, and unblock the signal in the parent thread straight after child thread creation, and in the child thread once it is fully initialized

Option 2 seems better. In practical terms, both options could be implemented by patching the JVM or interposing calls to the JVM's os::create_thread function (which would be rather heavy-handed), or perhaps by instrumenting the Thread::start method. Either way, this would still miss native threads created directly by other JVM agents (not sure if they would also crash). Anybody has a better idea?

RichardWarburton commented 8 years ago

I don't think patching the JVM is a desirable option. Aside from the time taken to upstream patches into Openjdk it would mean that the problem still exists on older JVMs without the patch.

RichardWarburton commented 8 years ago

Thinking about it, 1. introduces too much sampling bias for my liking as well. If there's a way of identifying when the native thread is ready to be profiled then we can just set a flag that indicates whether profiling is "on" or "off" based upon whether the thread is ready to be profiled. The other question would be how to identify the thread without having to call a non-async-safe function.

nitsanw commented 8 years ago

If GenEnv is not async-safe we would have to find a different solution. Here's a sketch:

cinquin commented 8 years ago

@nitsanw The thought of using some sort of async-safe collection of thread ids had occurred to me as well. The problem is that this requires a platform-independent, async-safe way of retrieving a unique identifier for the current thread. I don't think pthread_self fits the bill, given the following stack trace (generated under OS X) that shows the signal handler being called even before pthread_set_self has completed as part of thread initialization.

* thread #58: tid = 0x2f8aa2, 0x00007fff95906bc5 libsystem_kernel.dylib`__error
  * frame #0: 0x00007fff95906bc5 libsystem_kernel.dylib`__error
    frame #1: 0x00000001073a00e8 libjvm.dylib`signalHandler(int, __siginfo*, void*) + 24
    frame #2: 0x00007fff8f5a5eaa libsystem_platform.dylib`_sigtramp + 26
    frame #3: 0x00007fff8da28409 libdyld.dylib`tlv_get_addr + 5
    frame #4: 0x00000001071ea760 libjvm.dylib`jni_GetEnv + 81
    frame #5: 0x0000000109a39e52 liblagent.dylib`Profiler::handle(int, __siginfo*, void*) [inlined] JavaVM_::GetEnv(this=<unavailable>, penv=<unavailable>, version=65542) + 82 at jni.h:1921
    frame #6: 0x0000000109a39e40 liblagent.dylib`Profiler::handle(int, __siginfo*, void*) [inlined] Profiler::getJNIEnv(this=0x00007f97bb604480) + 14 at profiler.cpp:88
    frame #7: 0x0000000109a39e32 liblagent.dylib`Profiler::handle(this=0x00007f97bb604480, signum=<unavailable>, info=<unavailable>, context=0x000070000393ee20) + 50 at profiler.cpp:74
    frame #8: 0x00007fff8f5a5eaa libsystem_platform.dylib`_sigtramp + 26
    frame #9: 0x00007fff9590c5af libsystem_kernel.dylib`__thread_selfid + 11
    frame #10: 0x00007fff86df32f6 libsystem_pthread.dylib`_pthread_set_self + 25
    frame #11: 0x00007fff86df5b9e libsystem_pthread.dylib`_pthread_body + 14
    frame #12: 0x00007fff86df5b90 libsystem_pthread.dylib`_pthread_start + 168
    frame #13: 0x00007fff86df3375 libsystem_pthread.dylib`thread_start + 13

Unless I'm missing something, this just leaves instrumenting the Thread::start method to temporarily block the PROF signal as a practical option.

I find it curious that not so many people seem to have run into this problem. Perhaps the profiler is mostly run by people using Windows or Linux, and the threading implementation used by those platforms is not susceptible to these problems...

nitsanw commented 8 years ago

@cinquin we had a thread local storage for Env before which resulted in allocation in the signal handler, leading to crashes. We then moved to calling GetEnv which 'resolved' the issue on Windows/Linux. This is dependant on GetEnv being async-safe, which you point out is not the case. GetEnv relies on ThreadLocalStorage::thread() which is not async safe (as in return NULL or valid value, and not lock up), see discussion here: https://sourceware.org/glibc/wiki/TLSandSignals So now we are back to square one on that. Digging around on Google leads to these JDK tickets: There's this bug "JNI GetEnv() race with ThreadLocalStorage::set_thread() guarantee": https://bugs.openjdk.java.net/browse/JDK-4831461 Also related is this bug: https://bugs.openjdk.java.net/browse/JDK-4850081 In which it's suggested that this has actually been fixed and that the JVM is actually using it's own TLS impl in which a call to 'thread' is safe. Digging around the OpenJDK source lead to the impl for TLS::thread() varying by CPU/OS. So this may still be an OS specific issue... I've done as much digging as I can allow this morning, will pick this up later. It may be that GetEnv blowing up is a JDK bug on MacOS...

cinquin commented 8 years ago

Given that async safety of thread local storage seems to be a bit of a mess, I side-stepped the issue by blocking the PROF signal when a new thread is being spawned. This prevents crashes in the signal handler both under Mac OS X and FreeBSD. Please see pull request #108.

cinquin commented 8 years ago

On the topic of async safety, it would also be great to get input on issue #109