kadler / cpython

The Python programming language
https://www.python.org/
Other
2 stars 2 forks source link

Multiprocessing changes are incomplete/broken #4

Open kadler opened 4 years ago

kadler commented 4 years ago

PASE does not allow named semaphores to be shared via a process fork. Python code assumes this works. The fix is to use unnamed semaphores instead, but the previous hack was not quite optimal, because:

it caused false errors on cleanup procedures
it removed the SemLock `_rebuild` function, which also caused application errors, since applications would assume it to be there.

One example https://bitbucket.org/ibmi/opensource/issues/52/using-python-package-tpot-in-parallel-on

This slightly-better approach basically no-ops cleanup and rebuild functions (not necessary with unnamed semaphores) and seems to work well.

A simple sanity-check of multiprocessing support can be done with this code:

import os
from multiprocessing import Pool

def getpid(i):
    print(i)
    return i*i

pool = Pool(4)
v = pool.map_async(getpid, range(10))

pool.close()
pool.join()

print(v.get())

But we need to do some more testing with tpot and others

I cannot find any applications with a hard requirement on named semaphores, but we may need to revisit again if that's the case. We could conceivably put some kind of code in that tracks whether we're pre-fork or post-fork and re-attaches to the semaphore implicitly, rather than relying on fork() to do the right thing (which it isn't). Such an implementation would be exposed to several race conditions, so I'd prefer to avoid that for now and stick with unnamed semaphores.

kadler commented 4 years ago

So, I looked in to the history of this code and found changes under https://bugs.python.org/issue8713

Part of this change was to allow users on Unix-like operating systems to use spawn instead of fork/exec to simulate the behavior on Windows.

Users can call multiprocessing.set_start_method to choose between fork, spawn (fork/exec), and forkserver behavior. See here for more info on the difference between these.

Here's a new test:

import multiprocessing as mp
import sys

def foo(method, sem):
    print(method, sem.get_value())

if __name__ == '__main__':
    # Can be fork, spawn, or forkserver
    method = sys.argv[1] if len(sys.argv) > 1 else 'fork'

    mp.set_start_method(method)
    sem = mp.Semaphore(10)

    p = mp.Process(target=foo, args=(method, sem))
    p.start()
    p.join()

When using spawn & forkserver, Python creates a resource tracker to clean up shared resources when the process ends. Part of this is to call sem_unlink on any semaphores created. Since we created unnamed semaphores, the sem_unlink in SemLock._cleanup fails, preventing it from being de-registered, which then causes the resource tracker to try to unlink, and then that fails. This is what you get:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/QOpenSys/pkgs/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "/QOpenSys/pkgs/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
  File "/QOpenSys/pkgs/lib/python3.6/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
AttributeError: type object '_multiprocessing.SemLock' has no attribute '_rebuild'
Traceback (most recent call last):
  File "/QOpenSys/pkgs/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
    finalizer()
  File "/QOpenSys/pkgs/lib/python3.6/multiprocessing/util.py", line 186, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/QOpenSys/pkgs/lib/python3.6/multiprocessing/synchronize.py", line 87, in _cleanup
    sem_unlink(name)
FileNotFoundError: [Errno 2] No such file or directory
/QOpenSys/pkgs/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown
  len(cache))
/QOpenSys/pkgs/lib/python3.6/multiprocessing/semaphore_tracker.py:156: UserWarning: semaphore_tracker: '/mp-2cn07g9d': [Errno 2] No such file or directory
  warnings.warn('semaphore_tracker: %r: %s' % (name, e))

If we comment out the calls to sem_unlink, the tests still fail due to _rebuild missing.

Traceback (most recent call last):
  File "/QOpenSys/pkgs/lib/python3.6/multiprocessing/forkserver.py", line 196, in main
    _serve_one(s, listener, alive_r, old_handlers)
  File "/QOpenSys/pkgs/lib/python3.6/multiprocessing/forkserver.py", line 231, in _serve_one
    code = spawn._main(child_r)
  File "/QOpenSys/pkgs/lib/python3.6/multiprocessing/spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
  File "/QOpenSys/pkgs/lib/python3.6/multiprocessing/synchronize.py", line 110, in __setstate__
    self._semlock = _multiprocessing.SemLock._rebuild(*state)
AttributeError: type object '_multiprocessing.SemLock' has no attribute '_rebuild'

This we can't work around by commenting out the call to _rebuild, because it is needed to regenerate the semaphore object after being serialized and deserialized across the channel.

3 tries to fix that by enabling the _rebuild method to exist, but instead disabling the invalid sem_unlink code therein, but I don't think this will actually work in practice. Sending a sem_t* across a pipe is not going to yield a valid pointer on the other end.

Ultimately, I think we're stuck: unnamed semaphores can't be shared across processes except being inherited by fork() and named semaphores can't be inherited across forks... I think ultimately perhaps the easier fix would be to do the same _rebuild code in the fork case, but really, the problem lies in SLIC.