Non-deterministic SIGSEGV with pytest + log4j

pelson commented 3 years ago

I've been investigating a spurious segmentation fault which is happening after pytest completes and the Python shutdown process begins. The core dump is always happening inside jpype._core._JTerminate. I'm seeing about 1 in 20 cases failing, so I assume there is a race condition on Python+Java exit.

I don't have a full debug build, so can't provide a full stack trace presently, but wanted to document the existence of the problem. It has been a challenging one to strip down to a simple reproducer case, but the following will do it... it is all encapsulated in a gist at https://gist.github.com/pelson/2bf4c0d93e4492ddeaf7ee5596f14ba3. The code itself:

import logging
import pathlib

import jpype
import pytest

@pytest.fixture(scope="session")
def jvm_started():
    jar_directory = pathlib.Path(__file__).absolute().parent
    jars = sorted(jar_directory.glob('log4j-*.jar'))
    print('JARS:', '\n    '.join([pathlib.Path(jar).name for jar in jars]))
    classpath = ':'.join([str(path) for path in jars])
    return jpype.startJVM(classpath=classpath)

def test_foo(jvm_started):
    root = jpype.JPackage('org').apache.logging.log4j.LogManager.getLogger('root')
    root.warn("hello Java")

if __name__ == '__main__':
    pytest.main([
        __file__,
        '-ssvv',
    ])

The JARs are limited to 4 log4j packages. I didn't try to whittle it down further, because I found the fewer JARs the less likely it is to seg fault at exit (a clue?).

The gist could also be cloned and run in docker:

git clone git@gist.github.com:2bf4c0d93e4492ddeaf7ee5596f14ba3.git sigsegv_case
cd sigsegv_case/

docker build . -t repro-case && docker run -it repro-case /opt/repro/runner.sh

This will run the test case around 100 times, and fail as soon as a non-zero exit code is encountered (core dump).

I'd be really happy to track this down and solve it, as in a real life situation (300+ JARs) I was able to trigger the core dump more often than not. I don't yet have the tools to track this down further - I could do with setting up a debug build of JPype, openjdk, and CPython. Indeed, perhaps such a tool would be something good to have in a docker image... :thinking:

Any input on this would be very gratefully recieved.

Thrameos commented 3 years ago

A lot of things happen during the shutdown sequence. There are a lot of things that can go wrong.

1) A thread using a proxy open to Python is interrupted and still has resources open which then crash due to a race between the access and the shutdown. 2) A Python resource was released improperly during the clean up routine and then Python tries to reuse memory that should not have been freed. (freed twice or wrong memory) 3) A released resource gets accessed by something during shutdown results in recreating resources while the system is shutting down. This results in accessing a dead C++ resource.

So my usual approach is to first add println to each stage of the shutdown in org.jpype.JPypeContext, then disable each stage to see if I can isolate which part of the process is having an issue. Then I try to catch the failure in gdb, though depending on the failure this can be hard to interpret. (In some places the car leaves the cliff but doesn't hit the sea until much later.) I also use the python trace to make sure there are no python calls interacting with shutdown process. Then the single worst procedure is adding a print to trace each ref/unref and then use a python script to try to locate bad pairs if it appears to be a ref counting error.

It is a very hard process to isolate shutdown errors though I do appreciate if you can give it a shot.

Thrameos commented 3 years ago

Oh and I missed another common one. A race condition in the creation of a resource registers a resource twice during the creation process. This will result in a failure during shutdown process when the memory is cleaned up.

ShaheedHaque commented 3 years ago

Possible dupe of #842.

Thrameos commented 3 years ago

Can you look over #937 to see which options are needed to prevent the issue?

pelson commented 3 years ago

Either of jpype.config.free_resources = True and jpype.config.free_jvm = True stops the core dump from occurring.

pelson commented 3 years ago

Now that I have an editable build of the above docker image, I wen't ahead and built with trace enabled. Here is one successful run (of many) and one failing run:

Trace of a passing run

``` An atexit registration was made ('logging', 'shutdown') < PyInit__jpype > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x5563168c7c80 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x556316855520 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x5563168cc9c0 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_getattro < PyJPClass_getattro > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClass_getattro < PyJPClass_getattro > PyJPClass_getattro < PyJPClass_getattro > PyJPClass_customize < PyJPClass_customize > PyJPClass_getattro < PyJPClass_getattro > PyJPClass_customize < PyJPClass_customize > PyJPClass_getattro < PyJPClass_getattro > PyJPClass_customize < PyJPClass_customize > PyJPClass_getattro < PyJPClass_getattro > PyJPClass_customize < PyJPClass_customize > PyJPClass_getattro < PyJPClass_getattro > PyJPClass_customize < PyJPClass_customize > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x5563168ce470 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x5563168cea70 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x5563168d03a0 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x5563168d0750 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x5563168d0b00 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x5563168d0eb0 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x5563168d1260 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x5563168d1a70 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x5563168d1e20 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x5563168d21d0 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init An atexit registration was made ('jpype._core', '_JTerminate') > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_addTypeConversion id="0x7fcdbbef2fc8" > JPClassHints::addTypeConversion id="0x556316873b20" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_addTypeConversion id="0x7fcdbbef2b48" > JPClassHints::addTypeConversion id="0x55631691e690" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_excludeConversion id="0x7fcdbc014748" > JPClassHints::addTypeConversion id="0x556316857470" < JPClassHints::addTypeConversion < PyJPClassHints_excludeConversion > PyJPClassHints_addTypeConversion id="0x7fcdbc014748" > JPClassHints::addTypeConversion id="0x556316857470" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_addTypeConversion id="0x7fcdbc014948" > JPClassHints::addTypeConversion id="0x5563167a42a0" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_addTypeConversion id="0x7fcdbbef41c8" > JPClassHints::addTypeConversion id="0x5563167ed2d0" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPClassHints_addTypeConversion id="0x7fcdbbef48c8" > JPClassHints::addTypeConversion id="0x5563168f6240" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_addTypeConversion id="0x7fcdbbef4748" > JPClassHints::addTypeConversion id="0x5563168f3b50" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_addTypeConversion id="0x7fcdbbef4b48" > JPClassHints::addTypeConversion id="0x5563168f6610" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_addTypeConversion id="0x7fcdbbef4b88" > JPClassHints::addTypeConversion id="0x5563168e7900" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_new id="0x5563168c6f38" < PyJPClassHints_new > PyJPPackage_new < PyJPPackage_new > PyJPPackage_new < PyJPPackage_new ======================================================================================================== test session starts ======================================================================================================== platform linux -- Python 3.6.12, pytest-6.2.2, py-1.10.0, pluggy-0.13.1 rootdir: /opt/repro collected 1 item opt/repro/test_simple.py . [100%] ========================================================================================================= 1 passed in 0.73s ========================================================================================================= > PyJPObject_setattro > Py_GetAttrDescriptor < Py_GetAttrDescriptor > PyJPField_set > JPField::setStaticAttribute > JPBooleanType::findJavaConversion id="0x556316e38d10" < JPBooleanType::findJavaConversion < JPField::setStaticAttribute < PyJPField_set < PyJPObject_setattro > PyJPModule_shutdown > JPContext::shutdown JPContext::shutdown: Destroy JVM JPContext::shutdown: Delete resources < JPContext::shutdown < PyJPModule_shutdown > PyJPClassHints_dealloc id="0x7fcdbbfdaf88" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbbfdafc8" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbbfe5788" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbbfe57c8" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbc001dc8" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbc014648" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbc014748" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbc014848" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbc014948" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbc014a48" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbc014b48" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbc014c48" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbc014d48" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbc014ec8" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbbef2fc8" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbbef2b48" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbbef41c8" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbbef4748" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbbef48c8" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbbef4b48" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbbef4b88" < PyJPClassHints_dealloc > PyJPClassHints_dealloc id="0x7fcdbbef4d48" < PyJPClassHints_dealloc > PyJPValue_finalize id="0x7fcd9c0854c8" PyJPValue_finalize: type org.jpype.JPypeContext < PyJPValue_finalize > PyJPValue_finalize id="0x7fcd9c0858b8" PyJPValue_finalize: type org.jpype.classloader.DynamicClassLoader < PyJPValue_finalize ```

Trace of a failing run

``` An atexit registration was made ('logging', 'shutdown') < PyInit__jpype > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x55ce0b751c80 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x55ce0b6df520 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x55ce0b7569c0 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_getattro < PyJPClass_getattro > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClass_getattro < PyJPClass_getattro > PyJPClass_getattro < PyJPClass_getattro > PyJPClass_customize < PyJPClass_customize > PyJPClass_getattro < PyJPClass_getattro > PyJPClass_customize < PyJPClass_customize > PyJPClass_getattro < PyJPClass_getattro > PyJPClass_customize < PyJPClass_customize > PyJPClass_getattro < PyJPClass_getattro > PyJPClass_customize < PyJPClass_customize > PyJPClass_getattro < PyJPClass_getattro > PyJPClass_customize < PyJPClass_customize > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x55ce0b758470 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x55ce0b758a70 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x55ce0b75a3a0 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x55ce0b75a750 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x55ce0b75ab00 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x55ce0b75aeb0 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x55ce0b75b260 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x55ce0b75ba70 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x55ce0b75be20 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init > PyJPClass_new > PyJPValue_alloc PyJPValue_alloc: alloc _jpype._JClass 0x55ce0b75c1d0 < PyJPValue_alloc > PyJPClass_subclasscheck < PyJPClass_subclasscheck < PyJPClass_new > PyJPClass_init < PyJPClass_init An atexit registration was made ('jpype._core', '_JTerminate') > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_addTypeConversion id="0x7fb248e49bc8" > JPClassHints::addTypeConversion id="0x55ce0b6fdb20" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_addTypeConversion id="0x7fb248e49f88" > JPClassHints::addTypeConversion id="0x55ce0b7a8690" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_excludeConversion id="0x7fb248f6b748" > JPClassHints::addTypeConversion id="0x55ce0b6e1470" < JPClassHints::addTypeConversion < PyJPClassHints_excludeConversion > PyJPClassHints_addTypeConversion id="0x7fb248f6b748" > JPClassHints::addTypeConversion id="0x55ce0b6e1470" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_addTypeConversion id="0x7fb248f6b948" > JPClassHints::addTypeConversion id="0x55ce0b62e2a0" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_addTypeConversion id="0x7fb248e4b308" > JPClassHints::addTypeConversion id="0x55ce0b6772d0" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPClassHints_addTypeConversion id="0x7fb248e4b908" > JPClassHints::addTypeConversion id="0x55ce0b780240" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_addTypeConversion id="0x7fb248e4b788" > JPClassHints::addTypeConversion id="0x55ce0b77db50" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_addTypeConversion id="0x7fb248e4bb88" > JPClassHints::addTypeConversion id="0x55ce0b780610" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_addTypeConversion id="0x7fb248e4bbc8" > JPClassHints::addTypeConversion id="0x55ce0b771900" < JPClassHints::addTypeConversion < PyJPClassHints_addTypeConversion > PyJPClassHints_new id="0x55ce0b750f38" < PyJPClassHints_new > PyJPPackage_new < PyJPPackage_new > PyJPPackage_new < PyJPPackage_new ======================================================================================================== test session starts ======================================================================================================== platform linux -- Python 3.6.12, pytest-6.2.2, py-1.10.0, pluggy-0.13.1 rootdir: /opt/repro collected 1 item opt/repro/test_simple.py . [100%] ========================================================================================================= 1 passed in 0.69s ========================================================================================================= > PyJPObject_setattro > Py_GetAttrDescriptor < Py_GetAttrDescriptor > PyJPField_set > JPField::setStaticAttribute > JPBooleanType::findJavaConversion id="0x55ce0bcb27f0" < JPBooleanType::findJavaConversion < JPField::setStaticAttribute < PyJPField_set < PyJPObject_setattro > PyJPModule_shutdown > JPContext::shutdown JPContext::shutdown: Destroy JVM Fatal Python error: Segmentation fault Thread 0x00007fb249720740 (most recent call first): File "/opt/root/jpype/_core.py", line 352 in _JTerminate File "/opt/repro/test_simple.py", line 68 in /opt/repro/runner.sh: line 22: 687 Segmentation fault (core dumped) python $DIR/test_simple.py ```

In looking through this output, I wanted to make sure that I wasn't being tricked by any kind of buffering of stdout, so I added stdbuf -i0 -o0 -e0 to the start of the process... and wasn't able to reproduce the error. Is this another clue perhaps? I has subsequently found that piping stdout to stderr also prevents the core dump from occurring it seems.

Given the problem is related to logging in Java, I wonder if this is something to do with references to stdout/err streams somehow.

Thrameos commented 3 years ago

The log indicates the failure is currently in a portion of the process that is current not instrumented by the trace command. There is a compiler switch with turns on tracing of the memory accesses. But before I would go down that road I would first try to catch it in a gdb log. There is a pattern in the azure scripts for catching a backtrace automatically that I use when the segfault happens only in the CI and not locally that may be useful here so you dont need to run it manually just to catch the occasion segfault.

Thrameos commented 3 years ago

The command line for automatic stack traces is in .azure/scripts/debug.yml

pelson commented 3 years ago

Thanks, I tried it, but since the reproducer is sensitive to stream redirection, and gdb is going to capture that stream in order to present it inside gdb, I couldn't reproduce the problem in gdb.

FWIW, I tried turning on SIGSEGV printing and I can regularly see them being fired on the non-main thread. I've no idea if that is normal, but these ones don't seem to crash Python itself.

I therefore went rummaging through the log, and went to read about the DestroyJavaVM call that is happening in JPype. That documentation clearly states that "Unloading of the VM is not supported.". When disabling the unloading (which clears the VM pointer, and calls dlclose), the error isn't occurring. Are we basically doing something we shouldn't be here perhaps?

Could you let me know what the memory accesses compiler switch is, and I will see if that can point us any more at what is happening.

Thrameos commented 3 years ago

Well if the crash is a result of the unload then it is certainly something we can disable. The unload is definitely before my time and may in fact be incorrect. I am guessing that it was just to free memory usage. The original code just unloaded the JVM and could in fact reload it a second time. This caused the zombie objects form the first JVM load to try to resurrect themselves which was a sure fire crash. I added logic to prevent such silliness.

In general unloading a library may be suspect. There is always memory someplace that may still be referenced or in some stack of some inactive thread. Since Java doesn't have any way of stopping the daemon threads, there is certainly the possibility that get hit a segfault as the rug gets pulled out from under us.

Though this goes more towards my belief that the user issued shutdown command is pretty pointless in JPype. The idea of completely stopping the JVM is impossible. The best we can do is tell Java we plan to exit so it can perform clean up. If that is the case then the code that kills all the C++ wrappers is likely pointless as we are just going to exit anyway and collecting all that pointers to be deleted is just waste.

As far as the non-main thread, the shutdown command should only be respected from within the Python main thread (or actually just the thread that started JPype as we assume it is the main.) But I assume that is the case and the crash is likely coming from some daemon thread which suddenly finds the floor has fallen out from under it thanks to the unload. If this is the case then there is no real reason for further debugging. The correct solution is to make the free_jvm false by default and maybe even just delete that option.

This would be a reasonable explanation of the fault. The logger is a daemon thread in Java. If you have lots of modules logging then the logger will end up backed up with messages to write out. When the library gets unloaded the logging thread may still be in the runnable state rather than waiting. The unload deletes the memory that is being accessed and the method crashes which should hit the Java page fault routine, but as we are unloading the page fault routine is likely being unloaded as well. And assuming DestroyJVM didn't uninstall those hooks the fault handlers also may be left in a bad state. Thus nothing is there to intercept the fault so we crash.

For reference, should you want to do further debugging, to enable memory tracing it would be bit 2 of _jpype.trace() command.

void JPypeTracer::tracePythonObject(const char* msg, PyObject* ref)
{
        if ((_PyJPModule_trace & 2) == 0)
                return;
     ...

These are internal debugging flags so the tend to change whenever more information is required. As I got tired of recompiling or reading massive logs I made it so that you can enable a trace for only a relevant section of code.

Looking through the source I see the following flags available

Flag==0 - no tracing at all
Flag!=0 - normal tracing
Flag 2 - Python ref/unref
Flag 4 - Java ref/unref
Flag 16 - appears to print the name of the issuing caller.

pelson commented 3 years ago

Attaching the Flag 4 enabled trace for completeness. case.txt

jpype-project / jpype

Non-deterministic SIGSEGV with pytest + log4j #934