jpype-project / jpype

JPype is cross language bridge to allow Python programs full access to Java class libraries.
http://www.jpype.org
Apache License 2.0
1.12k stars 183 forks source link

Handle assertions during JVM startup #1170

Open FabienSe opened 10 months ago

FabienSe commented 10 months ago

Hello all.

I have a python application which was working great but today the startJVM is crashing :/ I did not change anything in my application. I do not see any logs.

By looking at the windows logs I can see the python.exe processus is crashing because of ntdll.dll, version : 10.0.19041.3636. located at C:\WINDOWS\SYSTEM32\ntdll.dll. After debug, I can see it is crashing in the startJVM call I perform.

Here is the call I perform

startJVM(str(jvm_path), "-ea", f"-Djava.library.path={resources_path}", classpath=classpaths, interrupt=False) I need to start the JVM with custom jar loaded with it.

I tried to start it with startJVM with no parameters and it seems it is working.

I do not have any idea on how to fix this problem and it is crazy that it is crashing now without any changes to my code.

Hopefully you can help me with my problem.

Thanks a lot for your help.

Thrameos commented 10 months ago

At best I can make some guess. It must be some change in the jvm_path or the dll that is located at the point. Perhaps it something in the script is now doing mixed architectures (32 vs 64). I would try to break down each step from the one which works to the one that doesn’t.

Alternatively it may be something that the JVM is loading indirectly like something in the resources path. Look for anything in those paths that changed in the time. Were it my site, where the system admin runs nightly “security patches”, I would assume that the admin pulled the rug out from me by changing the jdk or libraries.

Beyond these general tips I don’t think JPype devels can be much help as it is clearly a site install issue.

FabienSe commented 10 months ago

Thanks a lot for your help ! Thanks to your directions. I tried a lot of things. But it is still very strange to me.

startJVM() is working but startJVM("-ea") is not working. So I assume it is linked if I pass arguments or not to the JVM. It seems strange to me some site rules can impact this behavior.

Did you already see something similar ?

Thrameos commented 10 months ago

No. Though it is not very surprising that “-ea” would cause a problem. That enables assertions so that means something either in JPype, Java, or the jars loaded has an assert statement which is failing. Not sure how to find the failed assertion unless we can get some logs.

Can you try using the gdb instructions to capture a log of the failure? Either the traceback log or the crash dump from the JVM should show you the code that made the assert.

Thrameos commented 10 months ago

Looking at the documentation it appears that the assertion is sending up an exception. If it happens while the JVM is still getting going, then most likely sending us the flaming death. I think we should just put an issue to handle “java.lang.Error” during startup.

FabienSe commented 10 months ago

Hello.

I tried to use WinDbg to analyze a dump file of the crash. I got the following information.

ntdll!RtlReportFatalFailure+0x9:
00007ffa`f12cf349 eb00            jmp     ntdll!RtlReportFatalFailure+0xb (00007ffa`f12cf34b)
Resetting default scope

EXCEPTION_RECORD:  (.exr -1)
ExceptionAddress: 00007ffaf12cf349 (ntdll!RtlReportFatalFailure+0x0000000000000009)
   ExceptionCode: c0000374
  ExceptionFlags: 00000001
NumberParameters: 1
   Parameter[0]: 00007ffaf13397f0

PROCESS_NAME:  python.exe
ERROR_CODE: (NTSTATUS) 0xc0000374 - Un segment de m moire a  t  endommag .
EXCEPTION_CODE_STR:  c0000374
EXCEPTION_PARAMETER1:  00007ffaf13397f0
ADDITIONAL_DEBUG_TEXT:  Followup set based on attribute [Heap_Error_Type] from Frame:[0] on thread:[PSEUDO_THREAD] ; Followup set based on attribute [Is_ChosenCrashFollowupThread] from Frame:[0] on thread:[PSEUDO_THREAD]
FAULTING_THREAD:  ffffffff
STACK_TEXT:  
00000000`00000000 00000000`00000000 ntdll!RtlpFreeHeapInternal+0x0
STACK_COMMAND:  !heap ; ** Pseudo Context ** ManagedPseudo ** Value: ffffffff ** ; kb
SYMBOL_NAME:  ntdll!RtlpFreeHeapInternal+0
MODULE_NAME: ntdll
IMAGE_NAME:  ntdll.dll
FAILURE_BUCKET_ID:  HEAP_CORRUPTION_ACTIONABLE_BlockNotBusy_DOUBLE_FREE_c0000374_ntdll.dll!RtlpFreeHeapInternal
OS_VERSION:  10.0.19041.1
BUILDLAB_STR:  vb_release
OSPLATFORM_TYPE:  x64
OSNAME:  Windows 10
IMAGE_VERSION:  10.0.19041.3636
FAILURE_ID_HASH:  {f9e860eb-b03f-7415-804c-7e671e26c730}
Followup:     MachineOwner

Seems like a memory corruption.

I also performed tests on different computers. It is always crashing on 6 corporate computers but it is not crashing on computers which are not managed by my organisation. It is probably linked like you said on your previous message to the systems admins. Now the difficult part will be to know what they have change or how to fix it 😢

FabienSe commented 10 months ago

Looking at the documentation it appears that the assertion is sending up an exception. If it happens while the JVM is still getting going, then most likely sending us the flaming death. I think we should just put an issue to handle “java.lang.Error” during startup.

Can be a good improvement. Do you want me to create another issue ?

Thrameos commented 10 months ago

We can use this one for now. Just need to rename it to the required task for now. Something like “Handle assertions during JVM startup”

The issue is that the assertion is failing some place, and it is likely bubbling up at a point in time when the system is vulnerable because we don’t yet have all the resources in place the handle it.

The solution will be to deliberately add a bad assertion at different points in the JPype support jar file. We then force the assertion to fail which will reproduce the crash. We then redirect the crash to a normal Python exception so the user can see it. It doesn’t mean that JPype will run as the best we can do is provide diagnostics. This will need to be repeated at different places in the start up code as we don’t’ know where it is currently happening so we would have to guard all points.