jpype-project / jpype

JPype is cross language bridge to allow Python programs full access to Java class libraries.
http://www.jpype.org
Apache License 2.0
1.1k stars 180 forks source link

Can JPype work well in multiprocess? #1024

Open ouerum opened 2 years ago

ouerum commented 2 years ago

Hi, I want to use JPype to call java lib in my python project. In my project, there are there py model as follow:

  1. utilities.py, which call the function developed in Java
    
    import jpype.imports
    from jpype.types import *

class Utilities: def call_util(self, data): from java.lang import System System.out.println(data)

utilities = Utilities()

2. sub_process.py, which call singleton instance created in utilities.py

from utilities import utilities class subProc(): def init(self, data): self.data = data

def test(self):
    utilities.call_util(self.data)
3. main.py, the entrance of whole program, which run multi process from sub_process.py

import jpype from multiprocessing import Process from sub_process import subProc

if name == "main": pros = [] jvm_path = jpype.getDefaultJVMPath() jpype.startJVM(jvm_path, classpath=['lib/*']) for i in range(5): sub_proc = subProc(str(i)) pros.append(sub_proc) for pro in pros: diff_proc = Process(target=pro.test) diff_proc.start()

However, the program can not run as I through since some exception occurred. It is seem that JPype does not load package correctly in sub process. I have no idea about how to patch this bug. It may result in the mistake of usage in JPype?

/usr/lib/jvm/java-8-openjdk-amd64/jre/lib/rt.jar: invalid LOC header (bad signature) Process Process-1: Traceback (most recent call last): File "org.jpype.JPypeContext.java", line -1, in org.jpype.JPypeContext.getPackage File "org.jpype.pkg.JPypePackage.java", line -1, in org.jpype.pkg.JPypePackage. File "org.jpype.pkg.JPypePackageManager.java", line -1, in org.jpype.pkg.JPypePackageManager.getContentMap File "org.jpype.pkg.JPypePackageManager.java", line -1, in org.jpype.pkg.JPypePackageManager.getBaseContents File "org.jpype.pkg.JPypePackageManager.java", line -1, in org.jpype.pkg.JPypePackageManager.collectContents File "Files.java", line 2192, in java.nio.file.Files.isDirectory File "Files.java", line 1737, in java.nio.file.Files.readAttributes File "ZipFileSystemProvider.java", line 294, in com.sun.nio.zipfs.ZipFileSystemProvider.readAttributes File "ZipPath.java", line 723, in com.sun.nio.zipfs.ZipPath.getAttributes File "ZipFileSystem.java", line 325, in com.sun.nio.zipfs.ZipFileSystem.getFileAttributes File "ZipFileSystem.java", line 1375, in com.sun.nio.zipfs.ZipFileSystem.getEntry0 File "ZipFileSystem.java", line 1927, in com.sun.nio.zipfs.ZipFileSystem$Entry.readCEN File "ZipFileSystem.java", line 1940, in com.sun.nio.zipfs.ZipFileSystem$Entry.cen File "ZipUtils.java", line 122, in com.sun.nio.zipfs.ZipUtils.dosToJavaTime File "Date.java", line 254, in java.util.Date. File "Gregorian.java", line 37, in sun.util.calendar.Gregorian.newCalendarDate File "Gregorian.java", line 85, in sun.util.calendar.Gregorian.newCalendarDate Exception: Java Exception

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/ubuntu/miniconda3/envs/page-monitor/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/ubuntu/miniconda3/envs/page-monitor/lib/python3.9/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/ubuntu/diff-finder/sub_process.py", line 7, in test sub_utilities.call_util(self.data) File "/home/ubuntu/diff-finder/sub_utilities.py", line 10, in call_util from java.lang import System File "", line 1055, in _handle_fromlist java.lang.java.lang.NoClassDefFoundError: java.lang.NoClassDefFoundError: sun/util/calendar/CalendarDate

ouerum commented 2 years ago

Please help, thanks a lot.

Thrameos commented 2 years ago

Generally speaking Java does not work across multiple processes unless the JNI is started after the fork. If you start it before the fork the JNI is often broken resulting in random errors. I am not sure why this is the case. A fork should duplicate all resources and thus it should make no difference, but that does not appear to be the case in practice. JPype which calls Java through JNI is subject to this limitation. I am not aware of any work around.

ouerum commented 2 years ago

Got it, thanks a lot. 在 2022年1月6日 +0800 AM1:10,Karl Nelson @.***>,写道:

Generally speaking Java does not work across multiple processes unless the JNI is started after the fork. If you start it before the fork the JNI is often broken resulting in random errors. I am not sure why this is the case. A fork should duplicate all resources and thus it should make no difference, but that does not appear to be the case in practice. JPype which calls Java through JNI is subject to this limitation. I am not aware of any work around.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you authored the thread.Message ID: @.***>

mara004 commented 2 years ago

I'm affected by this problem as well. Not being able to perform time-taking tasks in parallel is a rather serious limitation. I understand the JVM may not be running before the subordinate processes are spawned, so I tried to call jpype.shutdownJVM() before the process pool is created and restart the JVM in the target function. However, this leads to an OSError describing that the JVM may not be re-initailised, and leaving this restart step out results in a JVM not initialised exception. Does this mean it is impossible to run jpype code in parallel?

Thrameos commented 2 years ago

Java may only be started once. So once started you may not fork. However, you may start a new process by other methods. For example, see the code in pytest in which we create many python instances running jpype to test for leaks. This limitation regarding forks is from java. I have no idea why a forked java is somehow different but it is a clear limitation if the jvm.

mara004 commented 2 years ago

Which test file(s) exactly are you referring to? Could you please outline the steps required to set up multiple processes that use one JVM?

Thrameos commented 2 years ago

The file subrun.py in test/jpypetest is an example in which many processes are running jvms. However, this is not a single shared JVM, but rather just a spawn of many python/jvm processes which can communicate though pipes, shared memory, and sockets.

mara004 commented 2 years ago

I see... However, that code is quite complex and goes far beyond my usualy multiprocessing.Pool usage.

Thrameos commented 2 years ago

Unfortunately that was the only method that I found would working with multiprocessing. I am not sure if it is possible to use the same method in a pool.

pelson commented 2 years ago

I see... However, that code is quite complex and goes far beyond my usualy multiprocessing.Pool usage.

That is a completely reasonably comment, but I encourage you to provide the code that you tried so that somebody can help more easily.

I put together and example of using concurrent.futures.ProcessPoolExecutor (which is probably preferably to multiprocessing.Pool fwiw), and had no real problem with it:


import concurrent.futures
import os

import jpype as jp

n_procs = 5
executor = concurrent.futures.ProcessPoolExecutor(max_workers=n_procs)

def use_jvm():
    # We can only do this once per process (so we could add a guard such
    # as "jp.isJVMStarted()").
    jp.startJVM()

    s = jp.java.lang.String('Hello World!')
    return f'{s.toUpperCase()} (pid {os.getpid()})'

futures = []
for _ in range(n_procs):
    futures.append(executor.submit(use_jvm))

for future in concurrent.futures.as_completed(futures):
    print(future.result())

executor.shutdown()

Results in:

$ python multiproc_jpype.py 
HELLO WORLD! (pid 53892)
HELLO WORLD! (pid 53891)
HELLO WORLD! (pid 53890)
HELLO WORLD! (pid 53887)
HELLO WORLD! (pid 53893)

I recommend that this issue can be closed.

pelson commented 2 years ago

FWIW, you can start the JVM in the parent process after the process pool has been created (that is the detail of the fork that @Thrameos was referring to).

@Thrameos - for fun I tried seeing what would happen if we tried to create a process pool after the JVM had been started. The first thing to note is the global JPContext_global is copied with the fork, so I hacked the isStarted check. Got:

unknown: Fatal error in exception handling
unknown: Handling: java.lang.UnsatisfiedLinkError: Native Library /media/important/github/jpype/jpype/_jpype.cpython-39-x86_64-linux-gnu.so already loaded in another classloader

unknown: Type: 0
unknown: Inner Java: java.lang.NullPointerException

unknown: native/common/jp_javaframe.cpp check 215
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/media/important/github/jpype/jpype/env_py39/lib/python3.9/concurrent/futures/process.py", line 243, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/media/important/github/jpype/jpype/support/multiproc_jpype.py", line 9, in start_jvm
    jp.startJVM()
  File "/media/important/github/jpype/jpype/jpype/_core.py", line 218, in startJVM
    _jpype.startup(jvmpath, tuple(args),
RuntimeError: Fatal error occurred
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/media/important/github/jpype/jpype/support/multiproc_jpype.py", line 27, in <module>
    print(future.result())
  File "/media/important/github/jpype/jpype/env_py39/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/media/important/github/jpype/jpype/env_py39/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
RuntimeError: Fatal error occurred
unknown: Fatal error in exception handling
unknown: Handling: java.lang.UnsatisfiedLinkError: Native Library /media/important/github/jpype/jpype/_jpype.cpython-39-x86_64-linux-gnu.so already loaded in another classloader

unknown: Type: 0
unknown: Inner Java: java.lang.NullPointerException

unknown: native/common/jp_javaframe.cpp check 215
unknown: Fatal error in exception handling
unknown: Handling: java.lang.UnsatisfiedLinkError: Native Library /media/important/github/jpype/jpype/_jpype.cpython-39-x86_64-linux-gnu.so already loaded in another classloader

Then created a resetJPContext function and added that to the beginning of the use_jvm. In the end, you get to the point where the JNI simply raises when calling JNI_CreateJavaVM. (the reason at https://github.com/openjdk/jdk/blob/50d47de8358e2f22bf3a4a165d660c25ef6eacbc/src/hotspot/share/prims/jni.cpp#L3604)

Conclusions:

mara004 commented 2 years ago

I recommend that this issue can be closed.

Please, before making such a statement, dive a bit deeper into the matter so you actually understand the impact of this limitation. The problem your ignore is what to do if you need the java/python bridge in the main process beforehand. In this case, re-initialising the JVM in the subprocess will result in OSError: JVM is already started. I found no reasonable way to bypass this (stopping the JVM before creating the process pool doesn't solve the problem).

(FYI, what I was trying to achieve is to render PDFs with Apache PdfBox, where we need to obtain page count before creating a process pool. However, it was so slow and over-complicated to get the result data from java to python anyway that I realised we are much better off with a subprocess interface instead of native bindings.)

pelson commented 2 years ago

Please, before making such a statement, dive a bit deeper into the matter so you actually understand the impact of this limitation.

Please see the comment above yours, in which I deeply investigated the possibility of starting the JVM and then subsequently making a process pool. It cannot be done, as a design limitation of JNI (I learned this in the process of actually trying to solve your problem). There is no (reasonable or not) way to bypass this limitation in your own code, or from JPype itself.

I found no reasonable way to bypass this

Whilst there is no way to start the JVM in the main process before making the process pool. It isn't clear why you couldn't have a another process that solves this problem before making the process pool.

I extended the example:

import concurrent.futures
import functools
import os

import jpype as jp
import _jpype
import jpype._core

def use_jvm():
    # We can only do this once per process (so we could add a guard such
    # as "jp.isJVMStarted()").
    jp.startJVM()

    s = jp.java.lang.String('Hello World!')
    return f'{s.toUpperCase()} (pid {os.getpid()})'

def rand_int():
    java = jp.JPackage('java')
    random = java.util.Random()
    return int(random.nextInt(10))

class SubprocessJVM:
    def __init__(self):
        self._pool = concurrent.futures.ProcessPoolExecutor(max_workers=1)
        # Make sure the JVM is started in the process.
        self._pool.submit(jp.startJVM).result()

    def run(self, fn, *args, **kwargs):
        # Run a function in a process with a running JVM.
        # NOTE: The function must return a *Python* type, not a JPype object,
        # since the host process does not have a running JVM.
        return self._pool.submit(fn, *args, **kwargs).result()

jvm = SubprocessJVM()

n_procs = jvm.run(rand_int)
executor = concurrent.futures.ProcessPoolExecutor(max_workers=n_procs)

futures = []
for _ in range(n_procs):
    futures.append(executor.submit(use_jvm))

for future in concurrent.futures.as_completed(futures):
    print(future.result())

executor.shutdown()

The crux of it is in:

def rand_int():
    java = jp.JPackage('java')
    random = java.util.Random()
    return int(random.nextInt(10))

class SubprocessJVM:
    def __init__(self):
        self._pool = concurrent.futures.ProcessPoolExecutor(max_workers=1)
        # Make sure the JVM is started in the process.
        self._pool.submit(jp.startJVM).result()

    def run(self, fn, *args, **kwargs):
        # Run a function in a process with a running JVM.
        # NOTE: The function must return a *Python* type, not a JPype object,
        # since the host process does not have a running JVM.
        return self._pool.submit(fn, *args, **kwargs).result()

jvm = SubprocessJVM()

n_procs = jvm.run(rand_int)

I recommend that this issue can be closed.

Given there remains nothing technically that JPype can do to workaround this limitation in JNI, I maintain this recommendation. The possibly caveats are (a) to provide specific documentation on this, (b) to provide helpers which allow convenient "jvm in subprocess" like my SubprocessJVM class.

Thrameos commented 2 years ago

The key issues that we face is the JVM just can't be "fork"ed (nothing to do with JPype), and that communications between processes is often using pickle which is not able to use a state.

Both of these limitations can be deal with by using alternatives. Simply replacing "fork" with "spawn" and pickle with JPickler it is possible to create multiprocesses and communicate. However, the limitations come instead from the implementation of the Python tools. Unfortunately, on this front I am not an expert. I don't know a lot about the Python multiprocessing and generally I use a different work flow when I need to do multiprocessing so I just can't address the issues.

So basic questions: Is there any way to get the Python processing pool to use "spawn" rather than fork? If we can use spawn there is no issue other than each spawned copy needs to start it own JVM. Second, is it possible to replace the pickle instance with JPickler? If you can't then communications are limited to Python objects.

For reference, I do a lot of multiprocessing using mixed languages. But for that I use ZeroMQ and Google protobufs (or occasionally Thrift) . However, this is a much more involved task as you must create a client/server communications in which the processes communicate by message exchange. As each process is its own, I often end up with projects in which one client is Python, another is Python with Java, another is pure Java, and another is C#. As ZeroMQ and Google protocol bufs exist in all languages you can freely implement in whatever tool is best for the job, but you have to fill out the communications stubs in each language you want to use, so it is no where as convenient as processing pool.

mara004 commented 2 years ago

Whilst there is no way to start the JVM in the main process before making the process pool. It isn't clear why you couldn't have a another process that solves this problem before making the process pool.

The problem is that initialising a process and communicating with it takes time. It would be ridiculous to isolate any java-python interaction in an external process. This might be tolerable workaround for a single task, but not if you wish to interface with many different functions of a library.

@pelson Instead of hastily closing this issue, it's better to take some time and consider possible improvements on the jpype side more thoroughly.

Thrameos commented 2 years ago

Well, the best improvement we can have would be the epypj reverse bridge which would allow you to actually make use of Java multiprocessing rather than the Python one only. Currently Pythons multiprocessing is crippled because the JVM limitations, but if Java can spin up a Python instance it becomes way easier as you would create a Java process pool running epypj instances which could run mixed Python/Java code. Right now if you try to do that you get Java only pool clients.

Unfortunately, as I have been discussing for two years now, I need a group of two to three interested developers to help write the test bench (I can handle the core, but checking every Java to Python interaction is just outside my current time budget.) Just imagine using numpy and matplot lib from Java as if it were a Java library.

pelson commented 2 years ago

Indeed, I came to the same consideration on my commute, and it does work to spawn instead of fork. In this case, there is no restriction on whether you have already started the JVM or not:

import concurrent.futures
import multiprocessing
import os

import jpype as jp

def use_jvm():
    if not jp.isJVMStarted():
        jp.startJVM()

    s = jp.java.lang.String('Hello World!')
    return f'{s.toUpperCase()} (pid {os.getpid()})'

def main():
    # Use the JVM for something before we even create a process pool.
    print(use_jvm())

    n_procs = 5
    # We may not use forking for creating new processes if the JVM is
    # required in the main process.
    ctx = multiprocessing.get_context('spawn')
    executor = concurrent.futures.ProcessPoolExecutor(max_workers=n_procs, mp_context=ctx)

    futures = []
    for _ in range(n_procs):
        futures.append(executor.submit(use_jvm))

    for future in concurrent.futures.as_completed(futures):
        print(future.result())

    executor.shutdown()

if __name__ == '__main__':
    main()

Note that I had to guard this inside a if __name__ == '__main__': as I was getting:

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Thrameos commented 2 years ago

So is that the solution we should put in the docs for how to use a pool? Or do we need to change the pickler to send java objects?