python / cpython

The Python programming language
https://www.python.org
Other
63.14k stars 30.23k forks source link

Option to kill "stuck" workers in a multiprocessing pool #58356

Open pfmoore opened 12 years ago

pfmoore commented 12 years ago
BPO 14148
Nosy @pfmoore, @brianquinlan, @pitrou

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', 'library'] title = 'Option to kill "stuck" workers in a multiprocessing pool' updated_at = user = 'https://github.com/pfmoore' ``` bugs.python.org fields: ```python activity = actor = 'bquinlan' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'paul.moore' dependencies = [] files = [] hgrepos = [] issue_num = 14148 keywords = [] message_count = 3.0 messages = ['154549', '154573', '154575'] nosy_count = 4.0 nosy_names = ['paul.moore', 'bquinlan', 'pitrou', 'neologix'] pr_nums = [] priority = 'normal' resolution = None stage = None status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue14148' versions = ['Python 3.3', 'Python 3.4'] ```

pfmoore commented 12 years ago

I have an application which fires off a number of database connections via a multiprocessing pool. Unfortunately, the database software occasionally gets "stuck" and a connection request hangs indefinitely. This locks up the whole process doing the connection, and cannot be interrupted except by killing the process.

It would be useful to have a facility to restart "stuck" workers in this case.

As an interface, I would suggest an additional argument to the AsyncResult.get method, kill_on_timeout. If this argument is true, and the get times out, the worker servicing the result will be killed and restarted.

Alternatively, provide a method on an AsyncResult to access the worker process that is servicing the request. I could then wait on the result and kill the worker manually if it does not respond in time.

Without a facility like this, there is a potential for the pool to get starved of workers if multiple connections hang.

pitrou commented 12 years ago

The problem is that queues and other synchronization objects can end up in an inconsistent state when a worker crashes, hangs or gets killed. That's why, in concurrent.futures, a crashed worker makes the ProcessPoolExecutor become "broken". A similar thing should be done for multiprocessing.Pool but it's a more complex object.

pfmoore commented 12 years ago

As an alternative, maybe leave the "stuck" worker, but allow the pool to recognise when a worker has not processed new messages for a long period and spawn an extra worker to replace it. That would avoid the starvation issue, and the stuck workers would die when the pool is terminated.