ninia / jep

Embed Python in Java
Other
1.3k stars 147 forks source link

(web server) Crash on getValue call on threads which are re-used #404

Open luxel opened 2 years ago

luxel commented 2 years ago

Describe the problem A clear and concise description of what the problem is.

We have built a web server with spring boot, serving API with HTTP requests. Some of the requests use JEP, and the rest of them don't. We're using ThreadLocal variable to hold ShareInterpreter for each thread, and never close those instances.

private static final ThreadLocal<Interpreter> interpreterThreadLocal = ThreadLocal.withInitial(()->{
        Interpreter interp = new SharedInterpreter();
        interp.exec("import numpy");
        interp.exec("import pwv");  // this module is importing other modules like pandas/astropy/neurokit2...
        return interp;
    });

We have found a strange crash behavior:

If a thread (let's say "XNIO-1 task-5") was first created to serve other HTTP requests which doesn't involve JEP (which doesn't initialize the ShareInterpreter in ThreadLocal), and when the second time, if the same thread "XNIO-1 task-5" is re-used for a request which triggers the initialization of ShareInterpreter, it crashes when we trying to invoke some python methods (but not on all methods) and gets the return value.

If a thread (let's say "XNIO-1 task-6") was first created to serve a request which involve JEP (which initializes the ShareInterpreter immediately), everything was ok. And if the same thread "XNIO-1 task-6" is used for the second or third time, it’s still working as expected.

Our temporary workaround - we created a filter which intercepts every request, and ensure the SharedInterpreter ThreadLocal is initialized for each thread when it’s created.

public class ThreadInitFilter extends OncePerRequestFilter {
    @Override
    protected void doFilterInternal(HttpServletRequest request, HttpServletResponse response, FilterChain filterChain) throws ServletException, IOException {
        AlgorithmPy.getAlgorithmPyInstance().initOnThread();
        filterChain.doFilter(request, response);
    }
}

public class AlgorithmPy {
private static final ThreadLocal<Interpreter> interpreterThreadLocal = ThreadLocal.withInitial(()->{
        Interpreter interp = new SharedInterpreter();
        interp.exec("import numpy");
        interp.exec("import pwv");  // this module is importing other modules like pandas/astropy/neurokit2...
        return interp;
    });
private static AlgorithmPy instance = new AlgorithmPy();

public synchronized void initOnThread() {
    interpreterThreadLocal.get();
}
}

Could any one help us figure out what’s behind the scene and is there any better solution?

Environment (please complete the following information):

Example crash log:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x00007ffeb4d54771, pid=6800, tid=6880
#
# JRE version: OpenJDK Runtime Environment Temurin-11.0.13+8 (11.0.13+8) (build 11.0.13+8)
# Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.13+8 (11.0.13+8, mixed mode, tiered, compressed oops, g1 gc, windows-amd64)
# Problematic frame:
# C  0x00007ffeb4d54771
#
# No core dump will be written. Minidumps are not enabled by default on client versions of Windows
#
# If you would like to submit a bug report, please visit:
#   https://github.com/adoptium/adoptium-support/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
---------------  T H R E A D  ---------------

Current thread (0x000001e9fe0cc800):  JavaThread "XNIO-1 task-5" [_thread_in_native, id=6880, stack(0x0000005b2be00000,0x0000005b2bf00000)]

Stack: [0x0000005b2be00000,0x0000005b2bf00000],  sp=0x0000005b2bef6d50,  free space=987k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  0x00007ffeb4d54771

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  jep.Jep.getValue(JLjava/lang/String;Ljava/lang/Class;)Ljava/lang/Object;+0
j  jep.Jep.getValue(Ljava/lang/String;)Ljava/lang/Object;+12
ndjensen commented 2 years ago

I don't know if we'll be able to solve this or not. A couple of questions:

  1. Is your call to getValue(String) literally only returning a value from Python to Java, or does it have some computation in it? For example, interp.getValue("x") vs interp.getValue("calculateX()").
  2. Do you know the Python type of the object you are getting? What is it?
luxel commented 2 years ago

Thank you for the reply!

  1. It's doing some math computations,in fact we're getting return result from Python functions.
  2. The Python code is returning a map object.

Unfortunately we don't have the source code for the Python part. P.s the Python libraries we use are.pyd libraries built with Cython.

bsteffensmeier commented 2 years ago

Unfortunately your use case looks quite complicated and I have not seen any similar reports so I cannot offer much guidance.

Your fix is currently the most perplexing part for me. I cannot think of anything that would change after handling a few requests in java which would make Jep more likely to crash. Do your other requests involve other native libraries or is it mostly java? There are a few places where jep interacts with the thread classloader, these have never caused crashes in the past but I wonder if maybe the other requests aren't affecting the class loader?

It might be helpful if you could split up your calls to jep, try to move the math computations into an exec and store the result as a variable you access with getValue. Most crashes are caused by third party libraries running into an unexpected environment, which would crash in the exec portion but when getValue is converting a map to java there is alot of jep code executing, so if you could prove whether the crash is coming from computation vs jep conversion that might provide some insight on the problem.

If I understand correctly, you are never closing any interpreter? Do your threads ever complete so that an interpreter becomes inaccessible? I'm not aware of any problems this would cause but it would be an interesting state to be in.

Would it be possible for you to open, use, and then close a new SharedInterpreter for every request. Since the sys.modules are shared between interpreter each import after the first should be a simple dict lookup and not take noticeable time.

I apologize that most of my ideas are just fishing for information but I do not have anything else to offer.