Closed ec31e85c-a35c-494f-805f-21a63c135c1b closed 3 years ago
The attached test program hangs eventually (it may need a few thousand of iterations).
Tested with Python v3.7.2 on Linux, amd64.
I've only got 3.7.1 Ubuntu bash on Windows (also amd64) immediately available, but I'm not seeing a hang, nor is there any obvious memory leak that might eventually lead to problems (memory regularly drops back to under 10 MB shared, 24 KB private working set). I modified your code to add a sys.stdout.flush() after the write so it would actually echo the dots as they were written instead of waiting for a few thousand of them to build up in the buffer, but otherwise it's the same code.
Are you sure you're actually hanging, and it's not just the output getting buffered?
You're right that sys.stdout.flush() is missing in my code; but on Linux it doesn't make a big difference, because multiprocessing flushes stdout before fork()ing.
And yes, it really hangs.
This seem related to https://bugs.python.org/issue35809
Could you use gdb/lldb to attach to the process hanging and give us a stack trace?
There are two processes running (parent and child) when the thing hangs. I'm attaching GDB backtraces for both.
@jwilk: thanks for creating cf-deadlock.py
I can replicate the test program hang on Fedora 29 with python3-3.7.2-4.fc29.x86_64
The test program hasn't yet hung on Fedora 29 with older packages, in particular python3-3.7.1-4.fc29.x86_64
My interest is due to the fact that the libreswan.org test suite has started to hang and we don't know why. It might well be this bug.
I've filed a Fedora bug report that points to this one: \https://bugzilla.redhat.com/show_bug.cgi?id=1691434\
Any update on this issue? I don't understand why the example hangs.
I've attached a variation on cf-deadlock.py that, should nothing happen for 2 minutes, will kill itself. Useful with git bisect.
I'm seeing cf-deadlock-alarm.py hangs on vanilla python 3.7.[0123] with:
Linux 5.0.5-100.fc28.x86_64 #1 SMP Wed Mar 27 22:16:29 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux glibc-2.27-37.fc28.x86_64
can anyone reproduce this?
I also wonder if this is connected to bpo-6721 where a recent "fix" made things worse - the fedora versions that work for libreswan don't have the "fix".
More info from adding a faulthandler ...
15 def f():
16 import ctypes
17
18 for i in range(0,50):
19 sys.stdout.write("\r%d" % i)
20 sys.stdout.flush()
21 signal.alarm(60*2)
22 for j in range(0,1000):
23 with concurrent.futures.ProcessPoolExecutor() as executor:
24 ftr = executor.submit(f)
25 ftr.result()
Thread 0x00007f1ce7fff700 (most recent call first): File "/home/python/v3.7.3/lib/python3.7/threading.py", line 296 in wait File "/home/python/v3.7.3/lib/python3.7/multiprocessing/queues.py", line 224 in _feed File "/home/python/v3.7.3/lib/python3.7/threading.py", line 865 in run File "/home/python/v3.7.3/lib/python3.7/threading.py", line 917 in _bootstrap_inner File "/home/python/v3.7.3/lib/python3.7/threading.py", line 885 in _bootstrap
Thread 0x00007f1cec917700 (most recent call first): File "/home/python/v3.7.3/lib/python3.7/selectors.py", line 415 in select File "/home/python/v3.7.3/lib/python3.7/multiprocessing/connection.py", line 920 in wait File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 354 in _queue_management_worker File "/home/python/v3.7.3/lib/python3.7/threading.py", line 865 in run File "/home/python/v3.7.3/lib/python3.7/threading.py", line 917 in _bootstrap_inner File "/home/python/v3.7.3/lib/python3.7/threading.py", line 885 in _bootstrap
Current thread 0x00007f1cfd9486c0 (most recent call first): File "/home/python/v3.7.3/lib/python3.7/threading.py", line 296 in wait File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/_base.py", line 427 in result File "cf-deadlock.py", line 25 in \<module>
Here's the children; yes there are somehow 4 children sitting around. Hopefully this is enough to figure out where things deadlock.
29970 8752 8752 29970 pts/6 8752 Sl+ 1000 1:00 | | \ ./v3.7.3/bin/python3 cf-deadlock.py 8752 8975 8752 29970 pts/6 8752 S+ 1000 0:00 | | \ ./v3.7.3/bin/python3 cf-deadlock.py 8752 8976 8752 29970 pts/6 8752 S+ 1000 0:00 | | \_ ./v3.7.3/bin/python3 cf-deadlock.py 8752 8977 8752 29970 pts/6 8752 S+ 1000 0:00 | | \ ./v3.7.3/bin/python3 cf-deadlock.py 8752 8978 8752 29970 pts/6 8752 S+ 1000 0:00 | | \ ./v3.7.3/bin/python3 cf-deadlock.py
8975
Current thread 0x00007f3be65126c0 (most recent call first): File "\<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed File "\<frozen importlib._bootstrap_external>", line 1043 in create_module File "\<frozen importlib._bootstrap>", line 583 in module_from_spec File "\<frozen importlib._bootstrap>", line 670 in _load_unlocked File "\<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked File "\<frozen importlib._bootstrap>", line 983 in _find_and_load File "/home/python/v3.7.3/lib/python3.7/ctypes/init.py", line 7 in \<module> File "\<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed File "\<frozen importlib._bootstrap_external>", line 728 in exec_module File "\<frozen importlib._bootstrap>", line 677 in _load_unlocked File "\<frozen importlib._bootstrap>", line 967 in _find_and_load_unlocked File "\<frozen importlib._bootstrap>", line 983 in _find_and_load File "cf-deadlock.py", line 17 in f File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 232 in _process_worker File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 99 in run File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 297 in _bootstrap File "/home/python/v3.7.3/lib/python3.7/multiprocessing/popen_fork.py", line 74 in _launch File "/home/python/v3.7.3/lib/python3.7/multiprocessing/popenfork.py", line 20 in \_init__ File "/home/python/v3.7.3/lib/python3.7/multiprocessing/context.py", line 277 in _Popen File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 112 in start File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 593 in _adjust_process_count File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 569 in _start_queue_management_thread File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 615 in submit File "cf-deadlock.py", line 25 in \<module>
8976
Current thread 0x00007f3be65126c0 (most recent call first): File "/home/python/v3.7.3/lib/python3.7/multiprocessing/connection.py", line 379 in _recv File "/home/python/v3.7.3/lib/python3.7/multiprocessing/connection.py", line 407 in _recv_bytes File "/home/python/v3.7.3/lib/python3.7/multiprocessing/connection.py", line 216 in recv_bytes File "/home/python/v3.7.3/lib/python3.7/multiprocessing/queues.py", line 94 in get File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 226 in _process_worker File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 99 in run File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 297 in _bootstrap File "/home/python/v3.7.3/lib/python3.7/multiprocessing/popen_fork.py", line 74 in _launch File "/home/python/v3.7.3/lib/python3.7/multiprocessing/popenfork.py", line 20 in \_init__ File "/home/python/v3.7.3/lib/python3.7/multiprocessing/context.py", line 277 in _Popen File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 112 in start File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 593 in _adjust_process_count File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 569 in _start_queue_management_thread File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 615 in submit File "cf-deadlock.py", line 25 in \<module>
8977
Current thread 0x00007f3be65126c0 (most recent call first): File "/home/python/v3.7.3/lib/python3.7/multiprocessing/synchronize.py", line 95 in __enter File "/home/python/v3.7.3/lib/python3.7/multiprocessing/queues.py", line 93 in get File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 226 in _process_worker File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 99 in run File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 297 in _bootstrap File "/home/python/v3.7.3/lib/python3.7/multiprocessing/popen_fork.py", line 74 in _launch File "/home/python/v3.7.3/lib/python3.7/multiprocessing/popenfork.py", line 20 in \_init File "/home/python/v3.7.3/lib/python3.7/multiprocessing/context.py", line 277 in _Popen File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 112 in start File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 593 in _adjust_process_count File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 569 in _start_queue_management_thread File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 615 in submit File "cf-deadlock.py", line 25 in \<module>
8978
Current thread 0x00007f3be65126c0 (most recent call first): File "/home/python/v3.7.3/lib/python3.7/multiprocessing/synchronize.py", line 95 in __enter File "/home/python/v3.7.3/lib/python3.7/multiprocessing/queues.py", line 93 in get File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 226 in _process_worker File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 99 in run File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 297 in _bootstrap File "/home/python/v3.7.3/lib/python3.7/multiprocessing/popen_fork.py", line 74 in _launch File "/home/python/v3.7.3/lib/python3.7/multiprocessing/popenfork.py", line 20 in \_init File "/home/python/v3.7.3/lib/python3.7/multiprocessing/context.py", line 277 in _Popen File "/home/python/v3.7.3/lib/python3.7/multiprocessing/process.py", line 112 in start File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 593 in _adjust_process_count File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 569 in _start_queue_management_thread File "/home/python/v3.7.3/lib/python3.7/concurrent/futures/process.py", line 615 in submit File "cf-deadlock.py", line 25 in \<module>
Reverting 3b699932e5ac3e76031bbb6d700fbea07492641d makes problem go away.
@hroncok see comment msg339370
Vanilla 3.7.0 (re-confirmed) didn't contain the change, nor did 3.6.8 (ok, that isn't vanilla) but both can hang using the test. It can take a while and, subjectively, it seems to depend on machine load. I've even struggled to get 3.7.3 to fail without load.
Presumably there's a race and grinding the test machine into the ground increases the odds of it happening.
The patch for bpo-6721 could be causing many things, but two to mind:
My hunch is the latter as the stack dumps look nothing like those I analyzed for bpo-36533 (see messages msg339454 and msg339458).
Gregory: It seems like https://github.com/python/cpython/commit/3b699932e5ac3e76031bbb6d700fbea07492641d is causing deadlocks which is not a good thing. What do you think of reverting this change?
A least 2 projects were broken by the logging change: libreswan and Anaconda.
I've filed a Fedora bug report that points to this one: \https://bugzilla.redhat.com/show_bug.cgi?id=1691434\
That's related to the libreswan project.
Last year, there was another regression in Anaconda: Fedora installer:
The workaround/fix was to revert 3b699932e5ac3e7 in Python. Anaconda has been modified, and we were able to revert the revert 3b699932e5ac3e7 :-)
I'm not sure what was the Anaconda fix. Maybe this change? https://github.com/rhinstaller/anaconda/pull/1721
(disclaimer: I'm mashing my high level backtraces in with @jwiki's low level backtraces)
The Python backtrace shows the deadlocked process called 'f' which then 'called': import ctypes which, in turn 'called': from _ctypes import Union, Structure, Array and that hung.
The low-level back-trace shows it was trying to acquire a lock (no surprises there); but the surprise is that it is inside of dlopen() trying to load '_ctypes...so'!
...
and the lock in question (assuming my sources roughly match above) seems to be:
/ We modify the list of loaded objects. \/ __rtld_lock_lock_recursive (GL(dl_load_write_lock));
presumably a thread in the parent held this lock at the time of the fork.
If one of the other children also has the lock pre-acquired then this is confirmed (unfortunately not having the lock won't rebut the theory).
So, any guesses as to what dl related operation was being performed by the parent?
----
I don't think the remaining processes are involved (and I've probably got 4 in total because my machine has 4 cores).
8976 - this acquired the multi-process semaphore and is blocked in '_recv' awaiting further instructions 8978, 8977 - these are blocked waiting for above to free the multi-process semaphore
Please do not blindly revert that. See my PR in https://bugs.python.org/issue36533 which is specific to this "issue" with logging.
I'd appreciate it if someone with an application running into the issue could be tested with my PR from bpo-36533 (https://github.com/python/cpython/pull/12704) applied.
https://github.com/python/cpython/pull/12704 doesn't fix the bug for me. Reverting 3b699932e5ac3e76031bbb6d700fbea07492641d doesn't fix it either.
Here's a possible stack taken during the fork():
Thread 1 "python3" hit Breakpoint 1, 0x00007ffff7124734 in fork () from /lib64/libc.so.6
Thread 1814 (Thread 0x7fffe69d5700 (LWP 23574)):
Thread 1 (Thread 0x7ffff7fca080 (LWP 20524)):
where, in my source code, dl_iterate_phdr() starts with something like:
/ Make sure nobody modifies the list of loaded objects. \/ __rtld_lock_lock_recursive (GL(dl_load_write_lock));
i.e., when the fork occures, the non-fork thread has acquired dl_load_write_lock - the same lock that the child will later try to acquire (and hang)
no clue as to what that thread is doing though; other than it looks like it is trying to generate a backtrace?
run ProcessPoolExecutor with one fixed child (over ride default of #cores)
script to capture stack backtrace at time of fork, last backtrace printed will be for hang
I am unable to get cf-deadlock.py to hang on my own builds of pure CPython 3.7.2+ d7cb2034bb or 3.6.8+ be77fb7a6e (versions i had in a local git clone).
which specific python builds are seeing the hang using? Which specific platform/distro version? "3.7.2" isn't enough, if you are using a distro supplied interpreter please try and reproduce this with a build from the CPython tree itself. distros always apply their own patches to their own interpreters.
...
Do realize that while working on this it is fundamentally *impossible* per POSIX for os.fork() to be safely used at the Python level in a process also using pthreads. That this _ever_ appeared to work is a pure accident of implementations of underlying libc, malloc, system libraries, and kernel behaviors. POSIX considers it undefined behavior. Nothing done in CPython can avoid that. Any "fix" for these kinds of issues is merely working around the inevitable which will re-occur.
concurrent.futures.ProcessPoolExecutor uses multiprocessing for its process management. As of 3.7 ProcessPoolExecutor accepts a mp_context parameter to specify the multiprocessing start method. Alternatively the default appears to be controllable as a global setting https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods.
Use the 'spawn' start method and the problem should go away as it'll no longer be misusing os.fork(). You _might be able to get the 'forkserver' start method to work, but only reliably if you make sure the forkserver is spawned _before any threads in the process (such as ProcessPoolExecutor's own queue management thread - which appears to be spawned upon the first call to .submit()).
We're discussing vanilla Python, for instance v3.7.0 is:
git clone .../cpython cd cpython git checkout v3.7.0 ./configure --prefix=/home/python/v3.7.0 make -j && make -j install
(my 3.6.x wasn't vanilla, but I clearly stated that)
Like I also mentioned, loading down the machine also helps. Try something like running #cores*2 of the script in parallel?
@gregory.p.smith, I'm puzzled by your references to POSIX and/or os.fork().
The code in question looks like:
import concurrent.futures
import sys
def f():
import ctypes
while True:
with concurrent.futures.ProcessPoolExecutor() as executor:
ftr = executor.submit(f)
ftr.result()
which, to me, looks like pure Python.
Are you saying that this code can't work on GNU/Linux systems.
concurrent.futures.ProcessPoolExecutor uses both multiprocessing and threading. multiprocessing defaults to using os.fork().
So:
#3 we've got a workaround - something like:
ProcessPoolExecutor(multiprocessing.get_context('spawn'))
but I'm guessing, the documentation is scant.
As for a fix, maybe:
"the single-threaded ProcessPoolExecutor test program"
I doubt it is single threaded, the .submit() method appears to spawn a thread internally.
FYI, I'm getting a similar deadlock in a child Python process which is stuck on locking a mutex from the dl library. See attached stack.
I'm not using concurrent.futures however, the parent Python process is a test driver that uses threading.Thread and subprocess.Popen to spawn new processes... I'm not using os.fork().
This occurred on ArchLinux with Python 3.7.4.
6 0x00007ffff7bc7712 in __pthread_unwind () from /lib64/libpthread.so.0
7 0x00007ffff7bbf7e7 in pthread_exit () from /lib64/libpthread.so.0
8 0x000000000051b2fc in PyThread_exit_thread () at Python/thread_pthread.h:238
9 0x000000000055ed16 in t_bootstrap (boot_raw=0x7fffe8da0e40) at ./Modules/_threadmodule.c:1021
The thread_run() function (previously "t_bootstrap()") which is the low-level C function to run a thread in Python _thread.start_new_thread() no longer calls PyThread_exit_thread(): see bpo-44434.
I'm able to reproduce the issue using attached cf-deadlock.py with Python 3.8:
In the main branch (with bpo-44434 fix), I can no longer reproduce the issue. I ran cf-deadlock.py in 4 terminals in parallel with "./python -m test -r -j2" in a 5th terminal for 5 minutes. I couldn't reproduce the issue. On Python 3.8, I reproduced the issue in less than 1 minute.
Can someone please confirm that the issue is now fixed? Can we mark this issue as a duplicate of bpo-44434?
I can no longer reproduce the bug with Python from git.
Great! I close the issue.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at =
created_at =
labels = ['3.7', '3.8', 'type-bug', 'library']
title = 'concurrent.futures deadlock'
updated_at =
user = 'https://github.com/jwilk'
```
bugs.python.org fields:
```python
activity =
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date =
closer = 'vstinner'
components = ['Library (Lib)']
creation =
creator = 'jwilk'
dependencies = []
files = ['48090', '48223', '48224', '48242', '48271', '48272', '48611']
hgrepos = []
issue_num = 35866
keywords = []
message_count = 34.0
messages = ['334618', '334648', '334652', '338376', '338377', '338402', '338501', '338549', '339334', '339361', '339370', '339451', '339464', '340263', '340278', '340343', '340344', '340358', '340359', '340360', '340362', '340430', '340431', '340432', '340442', '340451', '340456', '340459', '340464', '340468', '352707', '396675', '396910', '396929']
nosy_count = 11.0
nosy_names = ['gregory.p.smith', 'bquinlan', 'pitrou', 'vstinner', 'bentoi', 'jwilk', 'josh.r', 'cagney', 'hroncok', 'pablogsal', 'hugh']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue35866'
versions = ['Python 3.7', 'Python 3.8']
```