Open pelson opened 3 years ago
A lot of things happen during the shutdown sequence. There are a lot of things that can go wrong.
1) A thread using a proxy open to Python is interrupted and still has resources open which then crash due to a race between the access and the shutdown. 2) A Python resource was released improperly during the clean up routine and then Python tries to reuse memory that should not have been freed. (freed twice or wrong memory) 3) A released resource gets accessed by something during shutdown results in recreating resources while the system is shutting down. This results in accessing a dead C++ resource.
So my usual approach is to first add println to each stage of the shutdown in org.jpype.JPypeContext, then disable each stage to see if I can isolate which part of the process is having an issue. Then I try to catch the failure in gdb, though depending on the failure this can be hard to interpret. (In some places the car leaves the cliff but doesn't hit the sea until much later.) I also use the python trace to make sure there are no python calls interacting with shutdown process. Then the single worst procedure is adding a print to trace each ref/unref and then use a python script to try to locate bad pairs if it appears to be a ref counting error.
It is a very hard process to isolate shutdown errors though I do appreciate if you can give it a shot.
Oh and I missed another common one. A race condition in the creation of a resource registers a resource twice during the creation process. This will result in a failure during shutdown process when the memory is cleaned up.
Possible dupe of #842.
Can you look over #937 to see which options are needed to prevent the issue?
Either of jpype.config.free_resources = True
and jpype.config.free_jvm = True
stops the core dump from occurring.
Now that I have an editable build of the above docker image, I wen't ahead and built with trace enabled. Here is one successful run (of many) and one failing run:
In looking through this output, I wanted to make sure that I wasn't being tricked by any kind of buffering of stdout, so I added stdbuf -i0 -o0 -e0
to the start of the process... and wasn't able to reproduce the error. Is this another clue perhaps? I has subsequently found that piping stdout to stderr also prevents the core dump from occurring it seems.
Given the problem is related to logging in Java, I wonder if this is something to do with references to stdout/err streams somehow.
The log indicates the failure is currently in a portion of the process that is current not instrumented by the trace command. There is a compiler switch with turns on tracing of the memory accesses. But before I would go down that road I would first try to catch it in a gdb log. There is a pattern in the azure scripts for catching a backtrace automatically that I use when the segfault happens only in the CI and not locally that may be useful here so you dont need to run it manually just to catch the occasion segfault.
The command line for automatic stack traces is in .azure/scripts/debug.yml
Thanks, I tried it, but since the reproducer is sensitive to stream redirection, and gdb is going to capture that stream in order to present it inside gdb, I couldn't reproduce the problem in gdb.
FWIW, I tried turning on SIGSEGV printing and I can regularly see them being fired on the non-main thread. I've no idea if that is normal, but these ones don't seem to crash Python itself.
I therefore went rummaging through the log, and went to read about the DestroyJavaVM call that is happening in JPype. That documentation clearly states that "Unloading of the VM is not supported.". When disabling the unloading (which clears the VM pointer, and calls dlclose
), the error isn't occurring. Are we basically doing something we shouldn't be here perhaps?
Could you let me know what the memory accesses compiler switch is, and I will see if that can point us any more at what is happening.
Well if the crash is a result of the unload then it is certainly something we can disable. The unload is definitely before my time and may in fact be incorrect. I am guessing that it was just to free memory usage. The original code just unloaded the JVM and could in fact reload it a second time. This caused the zombie objects form the first JVM load to try to resurrect themselves which was a sure fire crash. I added logic to prevent such silliness.
In general unloading a library may be suspect. There is always memory someplace that may still be referenced or in some stack of some inactive thread. Since Java doesn't have any way of stopping the daemon threads, there is certainly the possibility that get hit a segfault as the rug gets pulled out from under us.
Though this goes more towards my belief that the user issued shutdown command is pretty pointless in JPype. The idea of completely stopping the JVM is impossible. The best we can do is tell Java we plan to exit so it can perform clean up. If that is the case then the code that kills all the C++ wrappers is likely pointless as we are just going to exit anyway and collecting all that pointers to be deleted is just waste.
As far as the non-main thread, the shutdown command should only be respected from within the Python main thread (or actually just the thread that started JPype as we assume it is the main.) But I assume that is the case and the crash is likely coming from some daemon thread which suddenly finds the floor has fallen out from under it thanks to the unload. If this is the case then there is no real reason for further debugging. The correct solution is to make the free_jvm false by default and maybe even just delete that option.
This would be a reasonable explanation of the fault. The logger is a daemon thread in Java. If you have lots of modules logging then the logger will end up backed up with messages to write out. When the library gets unloaded the logging thread may still be in the runnable state rather than waiting. The unload deletes the memory that is being accessed and the method crashes which should hit the Java page fault routine, but as we are unloading the page fault routine is likely being unloaded as well. And assuming DestroyJVM didn't uninstall those hooks the fault handlers also may be left in a bad state. Thus nothing is there to intercept the fault so we crash.
For reference, should you want to do further debugging, to enable memory tracing it would be bit 2 of _jpype.trace()
command.
void JPypeTracer::tracePythonObject(const char* msg, PyObject* ref)
{
if ((_PyJPModule_trace & 2) == 0)
return;
...
These are internal debugging flags so the tend to change whenever more information is required. As I got tired of recompiling or reading massive logs I made it so that you can enable a trace for only a relevant section of code.
Looking through the source I see the following flags available
I've been investigating a spurious segmentation fault which is happening after pytest completes and the Python shutdown process begins. The core dump is always happening inside
jpype._core._JTerminate
. I'm seeing about 1 in 20 cases failing, so I assume there is a race condition on Python+Java exit.I don't have a full debug build, so can't provide a full stack trace presently, but wanted to document the existence of the problem. It has been a challenging one to strip down to a simple reproducer case, but the following will do it... it is all encapsulated in a gist at https://gist.github.com/pelson/2bf4c0d93e4492ddeaf7ee5596f14ba3. The code itself:
The JARs are limited to 4 log4j packages. I didn't try to whittle it down further, because I found the fewer JARs the less likely it is to seg fault at exit (a clue?).
The gist could also be cloned and run in docker:
This will run the test case around 100 times, and fail as soon as a non-zero exit code is encountered (core dump).
I'd be really happy to track this down and solve it, as in a real life situation (300+ JARs) I was able to trigger the core dump more often than not. I don't yet have the tools to track this down further - I could do with setting up a debug build of JPype, openjdk, and CPython. Indeed, perhaps such a tool would be something good to have in a docker image... :thinking:
Any input on this would be very gratefully recieved.