ninia / jep

Embed Python in Java
Other
1.3k stars 147 forks source link

Question about SharedInterpreters, threads and class loaders #533

Closed svensk-pojke closed 3 months ago

svensk-pojke commented 3 months ago

Hello,

We are trying to switch from SubInterpreters to SharedInterpreters to get away from having shared modules. I have run into some problems and wanted to ask if you have seen or experienced anything like this.

My application uses two components (java classes) which are instantiated in a container (another java class). The container is of our own design and provides services to the components that are instantiated in it. To protect against name clashing we use custom class loaders for each of the components as well as the container. Each component runs in its own thread and instantiates a PythonInterpreter object, which basically wraps a jep.SharedInterpreter. The PythonInterpreter class is loaded using the same class loader as the container and uses a static initialization block to set the configuration on the SharedInterpreter class.

The problem I am having is that the 2nd invocation of SharedInterpreter crashes with an NPE (see stack trace, line numbers are for Jep 4.2):

[2024-05-24T21:08:29.016Z] (warning:::thread) TEST.CONTAINER: Cannot create or initialize new Python interpreter
                           jep.JepException: <class 'RuntimeError'>: java.lang.NullPointerException
                            at jep.Jep.set(Native Method)
                            at jep.Jep.set(Jep.java:380)
                            at jep.Jep.setupJavaImportHook(Jep.java:216)
                            at jep.Jep.configureInterpreter(Jep.java:199)
                            at jep.SharedInterpreter.configureInterpreter(SharedInterpreter.java:68)
                            at jep.Jep.<init>(Jep.java:170)
                            at jep.SharedInterpreter.<init>(SharedInterpreter.java:60)
                            at atst.cs.util.scripting.python.PythonInterpreter.<init>(PythonInterpreter.java:90)
                            at atst.cs.TestPyExecClass.lambda$startThread$0(TestPyExecClass.java:50)
                            at java.base/java.lang.Thread.run(Thread.java:829)
                            at java.base/java.lang.Thread.run(Thread.java:829)
                           Caused by: java.lang.NullPointerException

I have traced this all the way down from the SharedInterpreter call to the set call, and none of the arguments are null.

Interestingly, if I force the two components to use the same class loader as the container, then everything works fine. Also, it runs fine if I use SubInterpreters in place of the SharedInterpreters and each component uses its own class loader (with minor tweaks so each interpreter calls its own setConfig and I restore the shared modules list).

I don't think this is a bug -- we may be using Jep in a way that was not anticipated or intended. In any case, I am curious if you have any insight on this problem.

Many thanks for anything you can tell me. Erik

bsteffensmeier commented 3 months ago

I can't point to anything specific that is broken but I can confirm that this is not an intended use case. Most of the jep code is written with the assumption that the jep classes will only be loaded once in the JVM and I am not surprised it isn't working

One example of something I would expect to have problems is the MainInterpreter. Even with SubInterpreters the low level python c-api requires a single main interpreter and I am thinking with your use of class loaders you would end up with two. This calls python c-api functions that should only be called once so I was thinking it should be crashing or failing there but I looked at the code and noticed we are returning early if you get in the native code twice. This protects you from any errors but I would still expect the MainInterpreter isn't working right. This would probably only affect shared module imports and probably isn't causing your problems with SharedInterpreters.

Another potential problem would be that the native code caches frequently used classes. For the JVM provided classes I suspect the cache is valid for both classloaders but for the jep classes the cache would only reflect the classes from one classloader and an interpreter with the other class loader would likely have problems when native code isn't detecting specific classes. The only reason I suspect this may not be your problem is because the class cache is used the same way by both SubInterpreters and SharedInterpreters.

These are just a few examples I can think of where things could be broken by having multiple copies of the jep classes loaded in the same JVM. Assuming we could get it working, I do not see any benefit to running jep from multiple class loaders like this. Most of the configuration options in the JepConfig are copied into the sys module of the interpreter. At the native level python only allows a single sys module unless SubInterpreters are used so even if SharedInterpreters could be created from separate class loaders they would share configuration and there is no way around this without SubInterpreters which introduce the associated compatibility problems.

svensk-pojke commented 3 months ago

Thanks very much for the reply, Ben. There should be only a single main interpreter, because our PythonInterpreter wrapper code uses the container class loader so that the static setup of setConfig only happens once. But, it could very well be that I got something wrong with the class loading and it is indeed trying to instantiate two main interpreters. I will try debugging that aspect further. The idea was to do the class loading right so there is only a single main interpreter and each of the shared interpreters both use that. I'll keep digging. Thanks again for your help and the links to look at.

bsteffensmeier commented 3 months ago

You're stack trace shows you are in SharedInterpreter.configureInterpreter(). Since you mentioned your stack trace is from the second interpreter you must have loaded that class from a second class loader because otherwise the checks on the initialized variable would make it impossible to get into that code a second time.

svensk-pojke commented 3 months ago

Wow, sorry I missed that. It is indeed calling super.configureInterpreter(config) at Jep.java line 60. The stack trace is from the second interpreter, so something is definitely not working properly. I will track this down and report back what I am able to figure out. Thanks again for your help!

svensk-pojke commented 3 months ago

OK, I have it figured out. Turns out there is a race condition. I found it by inserting a print statement in Jep.configure to see what class loader is being used there:

    @Override
    protected void configureInterpreter(JepConfig config) throws JepException {
        System.out.println("From configureInterpreter, Classloader is " + this.getClass().getClassLoader().toString());
        synchronized (SharedInterpreter.class) {
            if (!initialized) {
                super.configureInterpreter(config);
                initialized = true;
            }
        }
    }

With that print statement in place, the code works every time. I also inserted log statements in other critical places to check which classloader was in use and everything is correct (unfortunately the way the the components are deployed and started will not work with a debugger, so print or log statements are my only option). If I remove the print statement from configureInterpreter() then the code fails repeatably. That's when I realized this is a race condition.

The solution is to use a synchronized block inside the constructor of my PythonInterpreter wrapper:

            synchronized(PythonInterpreter.class) {
                shell = new SharedInterpreter();
            }

With this in place, the code works fine every time. So, the interpreters are instantiated using the same class loader as desired, and the static block that calls configureInterpreter() uses the same class loader as well. That means only a single MainInterpreter and two SharedInterpreters that are instantiated from that.

Now I am going to test the heck out of it.

Many thanks for your help.

svensk-pojke commented 3 months ago

I am closing this ticket, as everything has been resolved. Thanks for your help, Ben!