Open pfmoore opened 12 years ago
I have an application which fires off a number of database connections via a multiprocessing pool. Unfortunately, the database software occasionally gets "stuck" and a connection request hangs indefinitely. This locks up the whole process doing the connection, and cannot be interrupted except by killing the process.
It would be useful to have a facility to restart "stuck" workers in this case.
As an interface, I would suggest an additional argument to the AsyncResult.get method, kill_on_timeout. If this argument is true, and the get times out, the worker servicing the result will be killed and restarted.
Alternatively, provide a method on an AsyncResult to access the worker process that is servicing the request. I could then wait on the result and kill the worker manually if it does not respond in time.
Without a facility like this, there is a potential for the pool to get starved of workers if multiple connections hang.
The problem is that queues and other synchronization objects can end up in an inconsistent state when a worker crashes, hangs or gets killed. That's why, in concurrent.futures, a crashed worker makes the ProcessPoolExecutor become "broken". A similar thing should be done for multiprocessing.Pool but it's a more complex object.
As an alternative, maybe leave the "stuck" worker, but allow the pool to recognise when a worker has not processed new messages for a long period and spawn an extra worker to replace it. That would avoid the starvation issue, and the stuck workers would die when the pool is terminated.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-feature', 'library']
title = 'Option to kill "stuck" workers in a multiprocessing pool'
updated_at =
user = 'https://github.com/pfmoore'
```
bugs.python.org fields:
```python
activity =
actor = 'bquinlan'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'paul.moore'
dependencies = []
files = []
hgrepos = []
issue_num = 14148
keywords = []
message_count = 3.0
messages = ['154549', '154573', '154575']
nosy_count = 4.0
nosy_names = ['paul.moore', 'bquinlan', 'pitrou', 'neologix']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue14148'
versions = ['Python 3.3', 'Python 3.4']
```