Open 2941a905-cf6d-4069-8e6f-bdecd9c20475 opened 4 years ago
We're having some problems with multiprocessing.Queue where the parent process ends up hanging with zombie children. The code is part of bitbake, the task execution engine behind OpenEmbedded/Yocto Project.
I've cut down our code to the pieces in question in the attached file. It doesn't give a runnable test case unfortunately but does at least show what we're doing. Basically, we have a set of items to parse, we create a set of multiprocessing.Process() processes to handle the parsing in parallel. Jobs are queued in one queue and results are fed back to the parent via another. There is a quit queue that takes sentinels to cause the subprocesses to quit.
If something fails to parse, shutdown with clean=False is called, the sentinels are sent. the Parser() process calls results.cancel_join_thread() on the results queue. We do this since we don't care about the results any more, we just want to ensure everyting exits cleanly. This is where things go wrong. The Parser processes and their queues all turn into zombies. The parent process ends up stuck in self.result_queue.get(timeout=0.25) inside shutdown().
strace shows its acquired the locks and is doing a read() on the os.pipe() it created. Unfortunately since the parent still has a write channel open to the same pipe, it hangs indefinitely.
If I change the code to do:
self.result_queue._writer.close()
while True:
try:
self.result_queue.get(timeout=0.25)
except (queue.Empty, EOFError):
break
i.e. close the writer side of the pipe by poking at the queue internals, we don't see the hang. The .close() method would close both sides.
We create our own process pool since this code dates from python 2.x days and multiprocessing pools had issues back when we started using this. I'm sure it would be much better now but we're reluctant to change what has basically been working. We drain the queues since in some cases we have clean shutdowns where cancel_join_thread() hasn't been used and we don't want those cases to block.
My question is whether this is a known issue and whether there is some kind of API to close just the write side of the Queue to avoid problems like this?
I should also add that if we don't use cancel_join_thread() in the parser processes, things all work out ok. There is therefore seemingly something odd about the state that is leaving things in. This issue doesn't occur every time, its maybe 1 in 40 runs where we throw parsing errors but I can brute force reproduce it.
Even my hack to call _writer.close() doesn't seem to be enough, it makes the problem rarer but there is still an issue. Basically, if you call cancel_join_thread() in one process, the queue is potentially totally broken in all other processes that may be using it. If for example another has called join_thread() as it was exiting and has queued data at the same time as another process exits using cancel_join_thread() and exits holding the write lock, you'll deadlock on the processes now stuck in join_thread() waiting for a lock they'll never get. I suspect the answer is "don't use cancel_join_thread()" but perhaps the docs need a note to point out that if anything is already potentially exiting, it can deadlock? I'm not sure you can actually use the API safely unless you stop all users from exiting and synchronise that by other means?
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['library', 'type-crash']
title = 'multiprocessing.Queue deadlock'
updated_at =
user = 'https://github.com/rpurdie'
```
bugs.python.org fields:
```python
activity =
actor = 'ned.deily'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'rpurdie'
dependencies = []
files = ['49444']
hgrepos = []
issue_num = 41714
keywords = []
message_count = 3.0
messages = ['376350', '376351', '376357']
nosy_count = 3.0
nosy_names = ['pitrou', 'davin', 'rpurdie']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'crash'
url = 'https://bugs.python.org/issue41714'
versions = ['Python 3.6']
```