startover / pythonfutures

Automatically exported from code.google.com/p/pythonfutures
Other
0 stars 0 forks source link

Child process termination not known by Parent error in concurrent.futures #29

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Submit a task using ProcessPoolExecutor
2. kill -9 <one_of_childrens_pid>
3. parent process gets blocked forever.

What is the expected output? What do you see instead?

We encountered an error in which if a child process dies or crashes the parent 
process is not notified and parent goes in blocked state. Other children are 
either in blocked or timed out state.

We were able to reproduce this scenario by using following code and by killing 
one of the child.

#!/home/y/bin64/python2.7

import concurrent.futures
import time
import signal
import os
import sys
import traceback

def just_wait(identifier):
    time.sleep(20)
    return identifier

def signal_handler(sig, stack):
    try:
        result = os.waitpid(-1, os.WNOHANG)
        while result[0]:
            print("Reaped child process %s" % result[0])
            result = os.waitpid(-1, os.WNOHANG)
        traceback.print_stack()
        sys.exit()    
    except (OSError):
        pass

def main():
    with concurrent.futures.ProcessPoolExecutor(max_workers=30) as executor:
        future_to_id = [executor.submit(just_wait, i) for i in range(1, 31)]
        for future in concurrent.futures.as_completed(future_to_id):
            returned_id = future.result()
            print "Process Id: ", returned_id

if __name__=='__main__':
    signal.signal(signal.SIGCHLD, signal_handler)
    main()

The status of one of the child processes:
$sudo strace -p 30974
Password: 
Process 30974 attached - interrupt to quit
restart_syscall(<... resuming interrupted call ...>) = -1 ETIMEDOUT (Connection 
timed out)
gettimeofday({1410964539, 104107}, NULL) = 0
gettimeofday({1410964539, 104165}, NULL) = 0
futex(0x7f3e698e7000, FUTEX_WAIT_BITSET|FUTEX_CLOCK_REALTIME, 0, {1410964539, 
204165000}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
gettimeofday({1410964539, 204812}, NULL) = 0
gettimeofday({1410964539, 204845}, NULL) = 0

The status for parent process:
sudo strace -p 30948
Process 30948 attached - interrupt to quit
futex(0x1addc30, FUTEX_WAIT_PRIVATE, 0, NULL

What version of the product are you using? On what operating system?
RHEL - 6.4.

Please provide any additional information below.
Here's the related issue that got fixed in python 3.3 - 
http://bugs.python.org/issue9205
Since we are using python 2.7.5, is this possible to backport this fix as well 
to futures for 2.7.5.

Original issue reported on code.google.com by immil...@yahoo-inc.com on 17 Sep 2014 at 9:25