jython / jython

Python for the Java Platform
https://www.jython.org
Other
1.14k stars 185 forks source link

Deadlocks in jython due to imports #321

Open Mo7mad-l3 opened 3 months ago

Mo7mad-l3 commented 3 months ago

Hello team.

We are encountering an issue in Jython 2.7.3 where it causes a deadlock due to static reference to the pySystemState.

We have a system that uses Jython in a multi-threaded manner, in some cases, the import process runs to import libraries such as codec, however, instead of using the pySystemState created for this thread, Jython is using another reference to the pySystemState to perform the import.

The issue is similar to https://bugs.jython.org/issue2536 and https://hg.python.org/jython/rev/2b06bc95594d, we've performed both suggested workaround of setting the XSS and updating the Jython Jar to 2.7.3 and the issue still occurs.

The method org.python.core.Py.getSystemState() is static and the method org.python.core.PySystemState.getCodecState() is synchronized.

The threads look like this, this thread is stuck in the getCodecState via pySystemState address 0x550e88b60, however, the pySystemState given to this thread is 0x54ed2e968

"https-openssl-nio-443-exec-245" - Thread t@1792157 java.lang.Thread.State: WAITING at java.base@11.0.7/jdk.internal.misc.Unsafe.park(Native Method)

The other thread is below, in this thread, Jython is attempting to synchronize the pySystemState that was locked by the first thread.

"Timer-46" - Thread t@1884806 java.lang.Thread.State: BLOCKED at org.python.core.JyAttribute.getAttr(JyAttribute.java:202)

Stewori commented 3 months ago

Do you have reason to assume that the JVM recovers from a StackOverflowError or an OutOfMemory error as explained in https://bugs.jython.org/issue2536 ? In that case, the initial issue must be avoided and analyzed. The JVM should not be recovered from errors, only from exceptions. If an error occurs one should only catch it for analysis and logging, not to keep the JVM running with further workload. Anyway, if no previous error recovery from StackOverflowError or OutOfMemory has occurred, the issue might be different from bjo2536. Before any further analysis is attempted, that case needs to be decided, or risk is high to waste one's time.

In any case, it would be good if code for reproducing this issue could be shared. Also, please try this with current master. There have been deadlock and race condition related fixes. An issue like this can be so complex and difficult to sort out that under all circumstances, any further investigation should take place using the master version only. One has to rule out the possibility that it might have already been fixed before investing work in this.

Mo7mad-l3 commented 3 months ago

Hello Stefan.

Thank you for your reply,

We do not have any code to recover from an OOM or a stackOverflow.

The issue we are encountering is with the py.SystemState stores a static reference as a default system state that is used for the import.

Unfortunately, this issue is intermittent and I’ve tried to create multiple versions of a tester class to reproduce this issue, but was not able to reproduce it on demand.

The issue is occurring in our production environment, and in some deployments, it’s causing multiple daily restarts, some of which have hundreds of users.

The deadlock occurs when one thread uses the codec class and obtains the synchronization lock on the py.SystemState and another thread tries to use the py.System state to import another library, the deadlocked threads are listed in my original comment.

Kind regards.

Stewori commented 3 months ago

We do not have any code to recover from an OOM or a stackOverflow.

Then this is not related to bjo2536. A codec-related deadlock issue has been fixed in https://github.com/jython/jython/pull/243 but not yet released. As I already pointed out, you need to check if the issue can be reproduced on master. Next release is upcoming in Q2, but no date set.

jeff5 commented 3 months ago

A codec-related deadlock issue has been fixed in #243 but not yet released. As I already pointed out, you need to check if the issue can be reproduced on master. Next release is upcoming in Q2, but no date set.

Thanks for correlating this with that PR @Stewori . So we have reason to hope that 2.7.4 (of which I'm trying to build a beta right now) will address this.

jeff5 commented 2 months ago

@Mo7mad-l3 : The beta is up on Sonatype now, so if you have the freedom to try that in your environment, we could learn whether this is still a problem or likely to be fixed already in 2.7.4.

Mo7mad-l3 commented 2 months ago

Hello everyone, thank you for all the details, I checked the other bug, and the stack traces are still different, still working on a tester class that could reproduce the issue.

In the meantime, I'll see about using the beta in our test environment and see if we encounter any side effects..

Thank you all for all your efforts.