python / cpython

The Python programming language
https://www.python.org
Other
63.03k stars 30.18k forks source link

Deadlock on using ProcessPoolExecutor in `fork` mp context #105464

Open kekekekule opened 1 year ago

kekekekule commented 1 year ago

Hi! I got an issue using Python 3.9.16 and 3.9.17

Launching this code (inspired by #90622)

from concurrent import futures
import faulthandler
import asyncio

def work(iteration, item):
    #anything here, even a pass
    pass

async def do_one(i):
    faulthandler.dump_traceback_later(20, repeat=True)
    with futures.ProcessPoolExecutor() as executor:
        loop = asyncio.get_event_loop()
        loop.run_in_executor(executor, work, i, 1)
        print(f"executor: max_workers={executor._max_workers}, processes={executor._processes}, method={executor._mp_context.get_start_method(allow_none=False)}, safe={executor._safe_to_dynamically_spawn_children}")
    faulthandler.cancel_dump_traceback_later()

async def run():
    for i in range(10000):
        await do_one(i)

async def main():
    task = asyncio.create_task(run())
    await asyncio.gather(task)

if __name__ == "__main__":
    asyncio.run(main())

Causes a deadlock sometimes.

The faulthandler dumps the traceback below on timeout:

Thread 0x0000004018db2700 (most recent call first):
  File "/usr/local/lib/python3.9/multiprocessing/popen_fork.py", line 27 in poll
  File "/usr/local/lib/python3.9/multiprocessing/popen_fork.py", line 43 in wait
  File "/usr/local/lib/python3.9/multiprocessing/process.py", line 149 in join
  File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 521 in join_executor_internals
  File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 343 in run
  File "/usr/local/lib/python3.9/threading.py", line 980 in _bootstrap_inner
  File "/usr/local/lib/python3.9/threading.py", line 937 in _bootstrap

Thread 0x0000004002fcf3c0 (most recent call first):
  File "/usr/local/lib/python3.9/threading.py", line 1080 in _wait_for_tstate_lock
  File "/usr/local/lib/python3.9/threading.py", line 1060 in join
  File "/usr/local/lib/python3.9/concurrent/futures/process.py", line 767 in shutdown
  File "/usr/local/lib/python3.9/concurrent/futures/_base.py", line 637 in __exit__
  File "//t.py", line 17 in do_one
  File "//t.py", line 22 in run
  File "/usr/local/lib/python3.9/asyncio/events.py", line 80 in _run
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 1905 in _run_once
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 601 in run_forever
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 634 in run_until_complete
  File "/usr/local/lib/python3.9/asyncio/runners.py", line 44 in run
  File "//t.py", line 31 in <module>

The pstree shows the following:

root@6b8cab06d7d7:/# pstree -pst
bash(1)─┬─python(4650)─┬─python(11873)───{python}(11875)
        │              ├─{python}(4652)
        │              └─{python}(11874)
        └─{bash}(8)

Running under docker launched in QEMU on Mac OS M1:

root@6b8cab06d7d7:/# uname -a
Linux 6b8cab06d7d7 5.15.68-0-virt #1-Alpine SMP Fri, 16 Sep 2022 06:29:31 +0000 x86_64 x86_64 x86_64 GNU/Linux

Tried to analyze under strace and suspect that the problem, here's the last output, after that it simply hangs:

37 close(7) = 0
37 close(6) = 0
37 getpid() = 37
304 write(11,0x630d440,104) = 104
304 read(8,0x5e2c410,4) = 4
304 read(8,0x5e2c410,4) = 4
304 getpid() = 304
304 write(11,0x5e427d0,19) = 19
304 exit_group(0)
37 wait4(304,275152568292,0,0,0,39) = 304
37 getpid() = 37

Having child processes with pids 304 and 305, on interrupting my script I got from strace that it hung on wait4 of pid = 305.

^C37 wait4(305,275152568292,0,0,0,39) = -1 errno=4 (Interrupted system call)
--- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL, si_pid=0, si_uid=0} ---

304 exits normally after SIGINT, 305 not, still hangs.

Did not check versions upper than 3.9.x. However, I do not exclude that the problem can be reproducible on higher versions of Python.

UPD. Reproduced the same on Python 3.10.10:

Thread 0x00000040106b5700 (most recent call first):
  File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 27 in poll
  File "/usr/local/lib/python3.10/multiprocessing/popen_fork.py", line 43 in wait
  File "/usr/local/lib/python3.10/multiprocessing/process.py", line 149 in join
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 521 in join_executor_internals
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 343 in run
  File "/usr/local/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/usr/local/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x0000004002fcd3c0 (most recent call first):
  File "/usr/local/lib/python3.10/threading.py", line 1116 in _wait_for_tstate_lock
  File "/usr/local/lib/python3.10/threading.py", line 1096 in join
  File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 775 in shutdown
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 649 in __exit__
  File "//t.py", line 14 in do_one
  File "//t.py", line 22 in run
  File "/usr/local/lib/python3.10/asyncio/events.py", line 80 in _run
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1909 in _run_once
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 603 in run_forever
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 636 in run_until_complete
  File "/usr/local/lib/python3.10/asyncio/runners.py", line 44 in run
  File "//t.py", line 31 in <module>

UPD2. Just investigated, that when issue happens, one of the children isn't even introduced in strace except wait4. UPD3. Some logging from multiprocessing:

[DEBUG/MainProcess] created semlock with handle 275333459968
[DEBUG/MainProcess] created semlock with handle 275333464064
[DEBUG/MainProcess] created semlock with handle 275333468160
[DEBUG/MainProcess] created semlock with handle 275333472256
[DEBUG/MainProcess] created semlock with handle 275333476352
[DEBUG/ForkProcess-3332] Queue._after_fork()
[INFO/ForkProcess-3332] child process calling self.run()
executor: max_workers=2, processes={32178: <ForkProcess name='ForkProcess-3331' pid=32178 parent=22185 started>, 32179: <ForkProcess name='ForkProcess-3332' pid=32179 parent=22185 started>}, method=fork, safe=False
[DEBUG/MainProcess] Queue._start_thread()
[DEBUG/MainProcess] doing self._thread.start()
[DEBUG/ForkProcess-3331] Queue._after_fork()
[DEBUG/MainProcess] starting thread to feed data to pipe
[DEBUG/MainProcess] ... done self._thread.start()
[INFO/ForkProcess-3331] child process calling self.run()
[DEBUG/MainProcess] telling queue thread to quit
[INFO/ForkProcess-3332] process shutting down
[DEBUG/ForkProcess-3332] running all "atexit" finalizers with priority >= 0
[DEBUG/ForkProcess-3332] running the remaining "atexit" finalizers
[DEBUG/MainProcess] Queue.join_thread()
[INFO/ForkProcess-3332] process exiting with exitcode 0
[DEBUG/MainProcess] joining queue thread
[INFO/ForkProcess-3331] process shutting down
[DEBUG/MainProcess] feeder thread got sentinel -- exiting
[DEBUG/MainProcess] ... queue thread joined
[DEBUG/ForkProcess-3331] running all "atexit" finalizers with priority >= 0
[DEBUG/ForkProcess-3331] running the remaining "atexit" finalizers
[INFO/ForkProcess-3331] process exiting with exitcode 0
[DEBUG/MainProcess] created semlock with handle 275333480448
[DEBUG/MainProcess] created semlock with handle 275333484544
[DEBUG/MainProcess] created semlock with handle 275333488640
[DEBUG/MainProcess] created semlock with handle 275333492736
[DEBUG/MainProcess] created semlock with handle 275333496832
executor: max_workers=2, processes={32184: <ForkProcess name='ForkProcess-3333' pid=32184 parent=22185 started>, 32186: <ForkProcess name='ForkProcess-3334' pid=32186 parent=22185 started>}, method=fork, safe=False
[DEBUG/MainProcess] Queue._start_thread()
[DEBUG/MainProcess] doing self._thread.start()
[DEBUG/MainProcess] starting thread to feed data to pipe
[DEBUG/MainProcess] ... done self._thread.start()
[DEBUG/ForkProcess-3334] Queue._after_fork()
[INFO/ForkProcess-3334] child process calling self.run()
[DEBUG/MainProcess] telling queue thread to quit
[DEBUG/MainProcess] Queue.join_thread()
[DEBUG/MainProcess] joining queue thread
[DEBUG/MainProcess] feeder thread got sentinel -- exiting
[INFO/ForkProcess-3334] process shutting down
[DEBUG/MainProcess] ... queue thread joined
[DEBUG/ForkProcess-3334] running all "atexit" finalizers with priority >= 0
[DEBUG/ForkProcess-3334] running the remaining "atexit" finalizers
[INFO/ForkProcess-3334] process exiting with exitcode 0
kekekekule commented 1 year ago

@AlexWaygood see you added this issue on a board. However, could you validate please whether this is a real bug that was not fixes by some multiprocessing/threading/concurrent patches in 3.9

AlexWaygood commented 1 year ago

Hi, I was just doing some drive-by triaging. I'm a core dev, but I don't primarily work on multiprocessing issues in CPython. I could investigate further, but I'm not a multiprocessing expert, so I'm probably not best placed to do so. I also have a number of other projects on the go at the moment. I added it to the board so that members of the team who are experts on multiprocessing will be able to find it and work on it more easily.

However, please try to see if the problem reproduces on Python 3.11 or later. Python 3.9 and 3.10 are now in "security-only" mode — we won't fix bugs in those versions of Python unless they relate to a security vulnerability.

kekekekule commented 1 year ago

@AlexWaygood Could reproduce on Python 3.11.2.

So, maybe it's worth labeling the issue for python 3.9, 3.10, 3.11.

AlexWaygood commented 1 year ago

So, maybe it's worth labeling the issue for python 3.9, 3.10, 3.11.

Unless you can persuasively argue it's a security issue, we won't be backporting a fix to 3.10 or older, so I won't add the 3.10 or 3.9 label for now.

kekekekule commented 1 year ago

A new thing I noticed: could reproduce this deadlock even on Python 3.8. Just highlighting for easier debugging.

YvesDup commented 1 year ago

Hi @kekekekule , I am just wondering if bug is still present when running ProcessPoolExecuter with a mp_context as spawn or forkserver ?

kekekekule commented 1 year ago

Spawn/forkserver -- no. From my repro it follows that I was using fork. I know spawn/forkserver is safer, but this does not cancel the issue :)

YvesDup commented 1 year ago

Spawn/forkserver -- no

Sorry, I don't understand if you ran your script with other context than 'fork' just to check if there is still something wrong.