oracle / graalpython

GraalPy – A high-performance embeddable Python 3 runtime for Java
https://www.graalvm.org/python/
Other
1.24k stars 108 forks source link

[multiprocessing] ModuleNotFoundError: No module named '_posixshmem' #180

Closed henryx closed 3 years ago

henryx commented 3 years ago

When try to launch this, it returns this error:

[enrico@tiberio test]$ graalpython mandelbrot.py 16000
P4
16000 16000
Traceback (most recent call last):
  File "/home/enrico/test/mandelbrot.py", line 74, in <module 'mandelbrot.py'>
    mandelbrot(int(argv[1]))
  File "/home/enrico/test/mandelbrot.py", line 70, in mandelbrot
    for row in rows:
  File "/home/enrico/test/mandelbrot.py", line 61, in compute_rows
    with Pool() as pool:
  File "/home/enrico/.local/bin/graalpython-20.3.0-linux-amd64/lib-python/3/multiprocessing/context.py", line 119, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild,
  File "/home/enrico/.local/bin/graalpython-20.3.0-linux-amd64/lib-python/3/multiprocessing/pool.py", line 212, in __init__
    self._repopulate_pool()
  File "/home/enrico/.local/bin/graalpython-20.3.0-linux-amd64/lib-python/3/multiprocessing/pool.py", line 303, in _repopulate_pool
    return self._repopulate_pool_static(self._ctx, self.Process,
  File "/home/enrico/.local/bin/graalpython-20.3.0-linux-amd64/lib-python/3/multiprocessing/pool.py", line 326, in _repopulate_pool_static
    w.start()
  File "/home/enrico/.local/bin/graalpython-20.3.0-linux-amd64/lib-python/3/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/enrico/.local/bin/graalpython-20.3.0-linux-amd64/lib-python/3/multiprocessing/context.py", line 284, in _Popen
    return Popen(process_obj)
  File "/home/enrico/.local/bin/graalpython-20.3.0-linux-amd64/lib-python/3/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/home/enrico/.local/bin/graalpython-20.3.0-linux-amd64/lib-python/3/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/enrico/.local/bin/graalpython-20.3.0-linux-amd64/lib-python/3/multiprocessing/popen_spawn_posix.py", line 39, in _launch
    from . import resource_tracker
ModuleNotFoundError: No module named '_posixshmem'
[enrico@tiberio test]$ 
timfel commented 3 years ago

Thank you for the report. Multiprocessing is currently not supported. We are planning to look at it some time next year, but I cannot promise that we will make it work then.

ytrezq commented 3 years ago

@timfel regarding threading, is it gil free like Jython or does pure python scientific code using objects which can t be serialized will have to run single threaded during weeks like with plain Cpython?

timfel commented 3 years ago

@ytrezq we will have a GIL like CPython

timfel commented 3 years ago

If we're really talking about pure Python code, then we run that much faster than CPython ((like PyPy, for example). Additionally, since we allow embedding into Java programs and running multiple independent Python interpreters in the same JVM process, we will then be able to send objects between true multi-threaded Python contexts within the same process.

ytrezq commented 3 years ago

If we're really talking about pure Python code, then we run that much faster than CPython ((like PyPy, for example). Additionally, since we allow embedding into Java programs and running multiple independent Python interpreters in the same JVM process, we will then be able to send objects between true multi-threaded Python contexts within the same process.

@timfel running faster on single thread is good, but when you have a 256 cores machine it can t match the boost offerred by using all threads. If graalvm don t even implement an option to not use a gil when not calling C code (which means not breaking Numpy), I will feel Jython using jni will remain superior in terms of performance. Recent versions of CPython also offer to run several independent interpreters in the same process.

Also the tendency in computer hardware these days is to decrease per thread performance for having faster cores even with recent Intel Xeon.

But you see, the problems is many scientific C libraries is they won t get __reduce__() implemented which means even if they are called sequencially nothing can be parallelized and the workloads have to run single threaded during months since the work can t be shared accross multiple Python instances whether sharing the same process or not.

timfel commented 3 years ago

Recent versions of CPython also offer to run several independent interpreters in the same process.

That's what we also offer.

But you see, the problems is many scientific C libraries is they won t get reduce() implemented which means even if they are called sequencially nothing can be parallelized and the workloads have to run single threaded during months since the work can t be shared accross multiple Python instances whether sharing the same process or not.

The link you give, and the reasons why something would not implement __reduce__ is (as in that instance) almost always going to be because of native data structures - and as long as CPython has a GIL, no C extension for Python will be able to run without a GIL. The CPython C API is intrinsically incapable of supporting multithreaded execution. There have been discussions on the python-dev mailinglist for years to alleviate this problem, but it's still there, and there's nothing we can do about that.

Taking a very long term view, as the GraalVM Python implementation matures, we will be able to release the GIL around much larger areas of Python code and use smaller locks than CPython (because many of our data structures are better suited for multi-threading), but as long as C extensions are written using the CPython C API, there is nothing we can do other than have a GIL.

ytrezq commented 3 years ago

But you see, the problems is many scientific C libraries is they won t get reduce() implemented which means even if they are called sequencially nothing can be parallelized and the workloads have to run single threaded during months since the work can t be shared accross multiple Python instances whether sharing the same process or not.

The link you give, and the reasons why something would not implement __reduce__ is (as in that instance) almost always going to be because of native data structures - and as long as CPython has a GIL, no C extension for Python will be able to run without a GIL. The CPython C API is intrinsically incapable of supporting multithreaded execution. There have been discussions on the python-dev mailinglist for years to alleviate this problem, but it's still there, and there's nothing we can do about that.

But in the same link, it s because the native data structures themselves don t serialize. But the library itself can be run in parallel using other languages like C++. Otherwise, what I m proposing is to not have a gil for Python code like with Jython (and thus activating on thread at time only one C module can run at the same time).

As an alternative (don t know how this can work with Graalvm) why not unleash mutltithreading performance through hardware accelerated transactional memory? (which is now supported by arm and Intel).

timfel commented 3 years ago

Otherwise, what I m proposing is to not have a gil for Python code like with Jython (and thus activating on thread at time only one C module can run at the same time).

Yes, this is something that we might do in the future if we find concrete use cases where it would actually help. Right now I am doubtful such exist. If you're running only one thread with e.g. scikit-learn stuff and all other threads don't even access any of the objects coming from any C extensions, then you could just use multiprocessing in the same process again, because you'd be dealing with one thread that does C API stuff and a lot of other threads that have perfectly serializable Python objects.

hardware accelerated transactional memory

The way to implement this would be orthogonal to what we're doing on the Python interpreter level. Also, both HTM and STM have been tried for Python with limited sucess - it turns out, there is just very many conflicts (and thus rollbacks) in Python code, even when the user is aware of and trying to write for transactional execution.

ytrezq commented 3 years ago

Otherwise, what I m proposing is to not have a gil for Python code like with Jython (and thus activating on thread at time only one C module can run at the same time).

Yes, this is something that we might do in the future if we find concrete use cases where it would actually help. Right now I am doubtful such exist. If you're running only one thread with e.g. scikit-learn stuff and all other threads don't even access any of the objects coming from any C extensions, then you could just use multiprocessing in the same process again, because you'd be dealing with one thread that does C API stuff and a lot of other threads that have perfectly serializable Python objects.

Simple! https://github.com/ConsenSys/mythril had gil bloated multithtread based which didn t use z3 most of the time but they need to submit Objects containing z3 contraints back to the dispatch thread.

timfel commented 3 years ago

Simple! https://github.com/ConsenSys/mythril had gil bloated multithtread based which didn t use z3 most of the time but they need to submit Objects containing z3 contraints back to the dispatch thread.

I don't know this library, but if the constraints are truly pure Python objects, you can serialize them and submit them from a different interpreter. If they are not, you still need to lock when handling them. And if the C code in the solver really can run thread-safe and in parallel without needing access to Python data structures after setting up the constraint network, it should just release the GIL before starting the constraint solving.

ytrezq commented 3 years ago

I don't know this library, but if the constraints are truly pure Python objects, you can serialize them and submit them from a different interpreter. If they are not, you still need to lock when handling them. And if the C code in the solver really can run thread-safe and in parallel without needing access to Python data structures after setting up the constraint network, it should just release the GIL before starting the constraint solving.

z3 contraints are z3 objects Yes locking is required for calling z3 because of the wrapper (if it s just reference passing no custom C code is called). But most the time the program executes pure Python code so that if the number of threads is low, it would remain profitable to have mutltithreading. That mean the Gil is only required when running custom C code.

timfel commented 3 years ago

z3 contraints are z3 objects

How do these objects come into being? You need to construct them somehow, which means you need to run C extension code which means you have to lock.

ytrezq commented 3 years ago

How do these objects come into being? You need to construct them somehow, which means you need to run C extension code which means you have to lock.

Yes but not always for reading them. And the time when Gil has to be activated is small enough to let many other threads running on Python code (one task that could be parallelized takes about 20 seconds to complete).

timfel commented 3 years ago

We need to lock even for just reading from those objects. But what you're describing can be parallelized even on CPython. I see two issues: a) that extension is calling out to Z3, which doesn't access any Python objects while solving the constraints, so the extension should release the GIL while the solver is solving constraints and b) if you have lots of threads doing work in Python to configure the constraints and these just need to exchange data from time to time, you should be creating pure-Python configuration objects, because these can be serialized and exchanged, and then each multiprocessing Python interpreter can have its own Z3 solver thread, and it'll all run in parallel.

This is what I meant when I said:

I am doubtful such exist

I think most workloads are just not written well to parallelize on Python, but often the only reason is that not enough engineering went into the workload, and a GIL-less Python just hides away that lack.

ytrezq commented 3 years ago

@timfel for a, the problem is minor as it s only a fraction of the time being spent.

For b, the answer is reading z3 s python objects with pointers using internal data structures which are not written but can t be serialized. The only way forward is serialization free parallel multithreading. This was acknowlwedged several times by mainteners and once again for the subinterpreters feature.

Also I forgot that performance wise, the objects are very large (several Gb) so that copying them is not a good idea performance wise in addition to require some additionnal Tb of ram.

timfel commented 3 years ago

for a, the problem is minor as it s only a fraction of the time being spent.

In the thread you linked, you yourself state that a large part of the time is spent in the Z3 optimizer code.

ytrezq commented 3 years ago

@timfel it was in the case of detecting overflow in computations using divides and/or multiply which makes z3 very slow. As it s the case with most of Ethereum code, I recognize the feature would be for a tiny but still usefull number of Ethereum smart contracts.

tstupka commented 3 years ago

basic multiprocessing support was recently merged into master, with that was able to run to script from above