Closed 2dd69dfc-b916-403d-a1ec-61ca3f3b66c0 closed 16 years ago
We have a Python HTTP server that, in the parent process, uses os.fork to spawn new children, but at the same time the parent could be spawning new threads (in threads other than the main thread -- only the main thread does forking).
Anwyay, it very rarely causes deadlock in a spawned child when that child tries to start a new thread.
I believe I've tracked this down to the _active_limbo_lock in the threading module. Specifically, if this lock is held by parent (because another thread is spawning a thread), just as os.fork happens, the child inherits the held lock and will then block trying to do any threading.* operations.
Just like the global interp. lock is overwritten in the child after fork, I think something similar should happen to threading._active_limbo_lock? (And more generally the user needs to be aware of locks passing through fork; but I think at least threading.py should "do the right thing").
This thread looks quite relevant:
groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&threadm=38E6F2BA.E66CAC90%40ensim.com&rnum=5&prev=/groups%3Fq%3Dpython%2Bfork%2Bthreading%2Bmodule%2B%2Block%26hl%3Den%26lr%3D%26ie%3DUTF-8%26oe%3DUTF-8%26sa%3DN%26scoring%3Dd
Logged In: YES user_id=539787
See some more typical info about mixing forks and threads: http://mail.python.org/pipermail/python-list/2001-September/066048.html
This seems not to be Python-related, but platform-related.
Logged In: YES user_id=323786
True -- there are many platform specific issues on the interaction of forking and threading.
However the _active_limbo_lock is entirely a Python issue (I think it can deadlock on any platform, unless the platform releases locks in child process after fork).
After forking, python already resets the GIL, and it should also reset any other locks that have global impact like _active_limbo_lock.
It's not only the _active_limbo_lock. All global structures of the threading module should be reinitialized (including the _MainThread instance); for that purpose, reload() can be used. I attach an example which exercises this problem. Normally the script should hang in the last step (using os.fork() and launching threads while some other threads call threading.enumerate() in a loop).
The safe_fork() function in the example is a replacement for os.fork() which tries to avoid any deadlock in the threading module. If it's deemed useful and bug-free, it could perhaps be added somewhere in the stdlib (in the threading module itself perhaps).
(or perhaps we should provide an API to hook into PyOS_AfterFork)
Attached patch releases the _active_limbo_lock after a fork(). It is not a complete solution, since existing Thread objects don't correspond to anything, but it corrects a problem in test_multiprocessing.
A slightly better patch, with tests.
Amaury - your latest patch doesn't seem to apply cleanly to trunk, it's choking on the Python/ceval.c changes
Here is a third patch with latest trunk. Did you apply it to a clean checkout?
I had to flip on ignore whitespace to patch: patch -p0 -l \<~/Desktop/fork-and-thread3.patch
Worked. Unfortunately, test_threading is locking up during a make test. Here's the verbose regrtest.py output:
woot:python-trunk jesse$ ./python.exe Lib/test/regrtest.py -v test_threading ...snip \<thread 9> done \<thread 9> is finished. 0 tasks are running all tasks done ok test_join_in_forked_from_thread (test.test_threading.ThreadJoinOnShutdown) ...
At this point it hangs (OS/X 10.5 latest)
Amaury, it looks like it's hanging in subprocess:
/Users/jesse/open_source/subversion/python- trunk/Lib/test/test_threading.py(338)_run_and_join() -> p = subprocess.Popen([sys.executable, "-c", script], stdout=subprocess.PIPE) (Pdb) step ...snip...
(Pdb) step --Call--
/Users/jesse/open_source/subversion/python- trunk/Lib/threading.py(788)current_thread() -> def current_thread(): (Pdb) step /Users/jesse/open_source/subversion/python- trunk/Lib/threading.py(789)current_thread() -> try: (Pdb) step /Users/jesse/open_source/subversion/python- trunk/Lib/threading.py(790)current_thread() -> return _active[_get_ident()] (Pdb) step /Users/jesse/open_source/subversion/python- trunk/Lib/subprocess.py(1097)_execute_child() -> data = os.read(errpipe_read, 1048576) # Exceptions limited to 1 MB (Pdb) step ... lockup
Here's the good news, with the patch applied to trunk, I'm not seeing hangs in the multiprocessing test suite. I'm running it on a tight loop excluding the threading tests to confirm.
Well I was a bit presumptuous that my thread+fork tests would pass on all platforms out of the box. We should disable the tests, or have someone with better Unix expertise examine and correct them. At least I feel that the actual correction in threading.py goes in the right direction.
In general I suggest replacing the lock with a new lock, rather than trying to release the existing one. Releasing *might* work in this case, only because it's really a semaphore underneath, but it's still easier to think about by just replacing.
I also suggest deleting _active and recreating it with only the current thread.
I don't understand how test_join_on_shutdown could succeed. The main thread shouldn't be marked as done.. well, ever. The test should hang.
I suspect test_join_in_forked_process should call os.waitpid(childpid) so it doesn't exit early, which would cause the original Popen.wait() call to exit before the output is produced. The same problem of test_join_on_shutdown also applies.
Ditto for test_join_in_forked_from_thread.
Looking over some of the other platforms for thread_*.h, I'm sure replacing the lock is the right thing.
A new patch:
I replaced "_active_limbo_lock.release()" by "_active_limbo_lock=_allocate_lock()"
I replaced the successive deletions in _active by a re-creation with only the current thread. There is no difference in the result, but I agree that the intent is more clear.
yes, the main thread is marked as done when the interpreter exits (hence the convoluted tests with subprocesses): in Modules/main.c, WaitForThreadShutdown() calls threading._shutdown().
Also, I hope the tests make more sense now.
FWIW: the threading tests still hang for me with the latest patch. I'm confirming that the patch itself fixes the hanging MP tests though
I can confirm that this seems to clear up the test_multiprocessing hangs on my mac, I ran make test in a loop almost all day yesterday without hangs. I would really suggest we get this in, minus the new test_threading tests for now.
I agree. My attempt to describe the behaviour of fork+threads in a platform-independent test is another issue.
Just one more thing: looking at Module/posixmodule.c, os.fork1() calls PyOS_AfterFork() in both parent and child processes. Is there a "if (pid == 0)" test missing, just like os.fork()?
I am leaving for the whole next week, so anyone, feel free to commit (part of) my patch if it helps.
The existing fork-and-thread4.patch doesn't actually reset _active_limbo_lock. Its missing a "global _active_limbo_lock" statement in the threading._after_fork() function.
and a few more bugs. a new patch is attached. With this applied, pitrou's fork_threading.py bug demonstration script finally does the right thing.
test_threading, including the new deadlock tests, and test_multiprocessing both pass.
Tested on x86 MacOS X 10.4 & x86 Ubuntu 8.04.
Would someone please try this on other platforms?
(The new threading test's use of subprocess with [sys.executable, '-c', """long script"""] makes me slightly nervous about portability outside of Linux and BSD.)
I still don't like the _after_fork() implementation. Its O(n) where n == number of threads the parent process had.
Very wasteful when the fork() was done in the most common case of being followed by an exec and calling os._exit(). It won't scale nicely with system load (forks will start taking longer and longer the more threads exist).
Could os.fork() be extended to have an optional will_exec_or_die parameter that determines if _after_fork() is even called at all? Things like subprocess should pass in True. The default should be False for compatiblity.
I've been testing greg's latest patch and it seems to work for me - I'm not seeing any test_multiprocessing hangs.
I still don't like the _after_fork() implementation. Its O(n) where n == number of threads the parent process had.
It may be O(n) but the inner loop looks very cheap. Even with n == 1000 I'm not sure it would make a difference.
However, are you sure the system thread identifier stays the same after a fork? I see that in _after_fork() you reuse the old ident for new_active instead of getting it from get_ident().
Greg/Antoine - do you have any problem with me applying the latest patch as-is today?
It would be nice to test under Windows first, if you can. Also, this bug entry should stay open until we discuss the remaining details.
Alas, I don't have a windows machine - I agree we should leave it open though
I've applied Greg's patch in 65032 on trunk.
I added a Misc/NEWS note for this in r65057.
This is a good candidate for backporting to release25-maint.
To answer Antoine Pitrou's question about using the old ident vs the new _get_ident(). I don't know if the forked process will have the same thread id. I expect so. If not, the code as recently committed will die on an assertion failure because current_thread() is simply an _active[_get_ident()] dict lookup that returns a DummyThread instance if the lookup fails.
As for Windows, Windows doesn't support os.fork and I don't see any calls to PyOS_AfterFork outside of the Modules/posixmodule.c.
Can somebody also merge this into Py3k, please?
Selon "Gregory P. Smith" \report@bugs.python.org\:
To answer Antoine Pitrou's question about using the old ident vs the new _get_ident(). I don't know if the forked process will have the same thread id.
Then wouldn't it be safer to use _get_ident() rather than re-using the old thread ID? It doesn't make the code any more complicated as far as I know :)
I've merged the change to py3k in r65065
test_3_join_in_forked_from_thread hangs for me in Py3k.
To add to ben's comment, under py3k the third test hangs, if you pull out the basic script code being executed in subprocess:
if 1:
import sys, os, time, threading
# a thread, which waits for the main program to terminate
def joiningfunc(mainthread):
mainthread.join()
print('end of thread')
if 1:
main_thread = threading.current_thread()
def worker():
childpid = os.fork()
if childpid != 0:
os.waitpid(childpid, 0)
sys.exit(0)
t = threading.Thread(target=joiningfunc,
args=(main_thread,))
print('starting worker')
t.start()
print('joining worker')
t.join() # Should not block: main_thread is already stopped
w = threading.Thread(target=worker)
w.start()
You'll note it hangs promptly at the join()
Ben commented out the hanging test, lowering this from a release blocker as the patch is on both trunk and 3k, and minus that third new test, test_threading and test_multiprocessing are both passing
Jesse: thanks for doing the py3k merge.
Antoine: yeah it might be safer to use _get_ident() but since the len(_active) == 1 assert is not firing we're probably fine as is.
A change to this that I was considering making to this code has been attached as threading-874900-improvement-gps01.diff.
-gps
Greg, I'm not sure your improvement patch is right, since some code may be holding a reference to the former _MainThread instance and expecting it to still be part of the active threads container.
On the other hand there are things in the Thread class that may need reinitializing after a fork (e.g. self.__block = Condition(Lock()), and self.__ident = _get_ident() :-))... so perhaps your patch is a good enough approximation of what is needed.
backported to release25-maint in r65789.
One of tests is hanging in 3.0.
specifically, test_3_join_in_forked_from_thread hangs in py3k and is currently disabled.
If you remove the print() call from joining_func(), does it stop hanging? It may be due to differences between the io library in py3k and the builtin file objects in 2.x. (looking at the date of the commit disabling the test, it is not related to the thread-safety changes to BufferedWriter, though)
This is also causing hangs in 2.5.
it passes on release25-maint for me on linux. i don't see it hanging in any of the 2.5 buildbots. looks like your r66003 change was a decent fix for windows judging by the buildbot.
(fwiw, that test you patched in 66003 should probably use readlines() and assertListEqual to be cross platform rather than the code thats there now but i'm not motivated to change that unless some other platform fails on it)
Ok, with this patch the test passes under py3k.
Apart from the trivial typo (_Thread__stopped -> _stopped), I had to change print("...") to os.write(1, b"...\n") in the tests as otherwise subprocess wouldn't receive any output from the third test (buffering problem? I don't know really).
I also added the ident fix I had already suggested in _after_fork(). If you put a print statement at this point you'll see the old and the new value are not the same.
I feel like that patch sort of avoids the problem by changing the test. The test is hanging for some reason, so we should try to fix that, not the test. :) I wonder if it has something to do with the various deadlocks we are discovering in the io library.
Benjamin, if you don't change the test, the deadlock problem is still solved, it's just that the third test fails because the subprocess stdout is empty instead of containing the desired value. It is *not* because the subprocess doesn't print anything (if you launch an equivalent program on the command line, everything is printed), rather it seems that subprocess doesn't get what is printed from the child process of the subprocess.
On Thu, Sep 4, 2008 at 6:08 PM, Antoine Pitrou \report@bugs.python.org\ wrote:
Antoine Pitrou \pitrou@free.fr\ added the comment:
Benjamin, if you don't change the test, the deadlock problem is still solved, it's just that the third test fails because the subprocess stdout is empty instead of containing the desired value. It is *not* because the subprocess doesn't print anything (if you launch an equivalent program on the command line, everything is printed), rather it seems that subprocess doesn't get what is printed from the child process of the subprocess.
Ah! My apologies for the giant misunderstanding.
Instead of os.write(), it is actually sufficient to sys.stdout.flush() at the end of the subprocess. Patch attached.
Just checked that the latest patch works on Windows as well.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at =
created_at =
labels = ['interpreter-core', 'type-bug', 'release-blocker']
title = 'threading module can deadlock after fork'
updated_at =
user = 'https://bugs.python.org/mikemccand'
```
bugs.python.org fields:
```python
activity =
actor = 'nirai'
assignee = 'none'
closed = True
closed_date =
closer = 'pitrou'
components = ['Interpreter Core']
creation =
creator = 'mikemccand'
dependencies = []
files = ['9288', '10889', '10932', '11378', '11379']
hgrepos = []
issue_num = 874900
keywords = ['patch', 'needs review']
message_count = 54.0
messages = ['60453', '60454', '60455', '61693', '61695', '69423', '69437', '69466', '69467', '69468', '69470', '69476', '69477', '69478', '69479', '69492', '69512', '69549', '69550', '69569', '69601', '69603', '69607', '69696', '69702', '69784', '69793', '69795', '69825', '69867', '69868', '69878', '69888', '69889', '69893', '69894', '69928', '70174', '71297', '71298', '71301', '71315', '71818', '71824', '72546', '72549', '72550', '72551', '72552', '72591', '72716', '72717', '72721', '72727']
nosy_count = 10.0
nosy_names = ['gregory.p.smith', 'jcea', 'mikemccand', 'tzot', 'amaury.forgeotdarc', 'Rhamphoryncus', 'pitrou', 'benjamin.peterson', 'jnoller', 'bobbyi']
pr_nums = []
priority = 'release blocker'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue874900'
versions = ['Python 3.0']
```