Closed ericsnowcurrently closed 4 years ago
In order to keep subinterpreters properly isolated, objects from one interpreter should not be used in C-API calls in another interpreter. That is relatively straight-forward except in one case: indicating that the other interpreter doesn't need the object to exist any more (similar to PyBuffer_Release() but more general). I consider the following solutions to be the most viable. Both make use of recounts to protect cross-interpreter usage (with incref before sharing).
I favor #2, since it is more generally applicable. #3 would probably be built on #2 anyway. #1 is relatively inefficient. With #2, Py_AddPendingCall() would become a simple wrapper around the new private function.
As a lesser (IMHO) alternative, we could also modify Py_DECREF to respect a new "shared" marker of some sort (perhaps relative to bpo-33607), but that would probably still require one of the refcount-based solutions (and add a branch to Py_DECREF).
"That is relatively straight-forward except in one case: indicating that the other interpreter doesn't need the object to exist any more (similar to PyBuffer_Release() but more general)"
Why an interpreter would access an object from a different interpreter? Each interpreter should have its own object space, no?
Adding a low level callback based mechanism to ask another interpreter to do work seems like a good way to go to me.
As an example of why that might be needed, consider the case of sharing a buffer exporting object with another subinterpreter: when the memoryview in the subinterpreter is destroyed, it needs to request that the buffer view be released in the source interpreter that actually owns the original object.
I would suggest that sharing of objects between interpreters should be stamped out. Could we have some #ifdef debug checking that would warn or assert so this doesn't happen? I know currently we share a lot of objects. However, in the long term, that does not seem like the correct design. Instead, each interpreter should have its own objects and any passing or sharing should be tightly controlled.
@Neil, we're definitely on the same page. In fact, in a world where subinterpreters do not share a GIL, we can't ever use an object in one interpreter that was created in another (due to thread safety on refcounts). The role of "tightly controlling" passing/sharing objects (for a very loose definition of "sharing") falls to the channels described in PEP-554. [1]
However, there are several circumstances where interpreters may collaborate that involves one holding a reference (but not using it) to an object owned by the other. For instance, see PyBuffer_Release(). [2] This issue is about addressing that situation safely. It is definitely not about safely using objects from other interpreters.
[1] The low-level implementation, including channels, already exists in Modules/_xxsubinterpretersmodule.c. [2] https://docs.python.org/3/c-api/buffer.html#c.PyBuffer_Release
New changeset ef4ac967e2f3a9a18330cc6abe14adb4bc3d0465 by Eric Snow in branch 'master': bpo-33608: Factor out a private, per-interpreter _Py_AddPendingCall(). (GH-11617) https://github.com/python/cpython/commit/ef4ac967e2f3a9a18330cc6abe14adb4bc3d0465
python/cpython#55826 has introduces a slight performance regression which will need to be resolved. If I can't find the problem right away then I'll probably temporarily revert the commit.
See https://mail.python.org/pipermail/python-dev/2019-February/156451.html.
At least, it *might* be a performance regression. Here are the two commits I tried:
after: ef4ac967e2f3a9a18330cc6abe14adb4bc3d0465 (PR bpo-11617, from this issue) before: 463572c8beb59fd9d6850440af48a5c5f4c0c0c9 (the one right before)
After building each (a normal build, not debug), I ran the micro-benchmark Raymond referred to ("./python Tools/scripts/var_access_benchmark.py") multiple times. There was enough variability between runs from the same commit that I'm uncertain at this point if there really is any performance regression. For the most part the results between the two commits are really close. Also, the results from the "after" commit fall within the variability I've seen between runs of the "before" commit.
I'm going to try with the "performance" suite (https://github.com/python/performance) to see if it shows any regression.
Here's what I did (more or less):
$ python3 -m pip install --user perf
$ python3 -m perf system tune
$ git clone git@github.com:python/performance.git
$ git clone git@github.com:python/cpython.git
$ cd cpython
$ git checkout ef4ac967e2f3a9a18330cc6abe14adb4bc3d0465
$ ./configure
$ make -j8
$ ./python ../performance/pyperformance run --fast --output speed.after
$ git checkout HEAD^
$ make -j8
$ ./python ../performance/pyperformance run --fast --output speed.before
$ ./python ../performance/pyperformance compare speed.before speed.after -O table
Here are my results:
speed.before \============
Performance version: 0.7.0 Report on Linux-4.15.0-45-generic-x86_64-with-glibc2.26 Number of logical CPUs: 4 Start date: 2019-02-25 20:39:44.151699 End date: 2019-02-25 20:48:04.817516
speed.after \===========
Performance version: 0.7.0 Report on Linux-4.15.0-45-generic-x86_64-with-glibc2.26 Number of logical CPUs: 4 Start date: 2019-02-25 20:29:56.610702 End date: 2019-02-25 20:38:11.667461
+-------------------------+--------------+-------------+--------------+-----------------------+ | Benchmark | speed.before | speed.after | Change | Significance | +=========================+==============+=============+==============+=======================+ | 2to3 | 447 ms | 440 ms | 1.02x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | chaos | 155 ms | 156 ms | 1.01x slower | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | crypto_pyaes | 154 ms | 152 ms | 1.01x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | deltablue | 10.7 ms | 10.5 ms | 1.01x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | django_template | 177 ms | 172 ms | 1.03x faster | Significant (t=3.66) | +-------------------------+--------------+-------------+--------------+-----------------------+ | dulwich_log | 105 ms | 106 ms | 1.01x slower | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | fannkuch | 572 ms | 573 ms | 1.00x slower | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | float | 149 ms | 146 ms | 1.02x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | go | 351 ms | 349 ms | 1.00x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | hexiom | 14.6 ms | 14.4 ms | 1.01x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | html5lib | 126 ms | 122 ms | 1.03x faster | Significant (t=3.46) | +-------------------------+--------------+-------------+--------------+-----------------------+ | json_dumps | 17.6 ms | 17.2 ms | 1.02x faster | Significant (t=2.65) | +-------------------------+--------------+-------------+--------------+-----------------------+ | json_loads | 36.4 us | 36.1 us | 1.01x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | logging_format | 15.2 us | 14.9 us | 1.02x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | logging_silent | 292 ns | 296 ns | 1.01x slower | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | logging_simple | 13.7 us | 13.4 us | 1.02x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | mako | 22.9 ms | 22.5 ms | 1.02x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | meteor_contest | 134 ms | 134 ms | 1.00x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | nbody | 157 ms | 161 ms | 1.03x slower | Significant (t=-3.85) | +-------------------------+--------------+-------------+--------------+-----------------------+ | nqueens | 134 ms | 132 ms | 1.02x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | pathlib | 30.1 ms | 31.0 ms | 1.03x slower | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | pickle | 11.5 us | 11.6 us | 1.01x slower | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | pickle_dict | 29.5 us | 30.5 us | 1.03x slower | Significant (t=-6.37) | +-------------------------+--------------+-------------+--------------+-----------------------+ | pickle_list | 4.54 us | 4.58 us | 1.01x slower | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | pickle_pure_python | 652 us | 651 us | 1.00x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | pidigits | 212 ms | 215 ms | 1.02x slower | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | python_startup | 11.6 ms | 11.5 ms | 1.01x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | python_startup_no_site | 8.07 ms | 7.95 ms | 1.01x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | raytrace | 729 ms | 722 ms | 1.01x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | regex_compile | 249 ms | 247 ms | 1.01x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | regex_dna | 203 ms | 204 ms | 1.00x slower | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | regex_effbot | 3.55 ms | 3.55 ms | 1.00x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | regex_v8 | 28.3 ms | 28.3 ms | 1.00x slower | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | richards | 105 ms | 105 ms | 1.00x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | scimark_fft | 429 ms | 436 ms | 1.01x slower | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | scimark_lu | 238 ms | 237 ms | 1.00x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | scimark_monte_carlo | 144 ms | 139 ms | 1.04x faster | Significant (t=3.61) | +-------------------------+--------------+-------------+--------------+-----------------------+ | scimark_sor | 265 ms | 260 ms | 1.02x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | scimark_sparse_mat_mult | 5.41 ms | 5.25 ms | 1.03x faster | Significant (t=4.26) | +-------------------------+--------------+-------------+--------------+-----------------------+ | spectral_norm | 174 ms | 173 ms | 1.01x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | sqlalchemy_declarative | 216 ms | 214 ms | 1.01x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | sqlalchemy_imperative | 41.6 ms | 41.2 ms | 1.01x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | sqlite_synth | 3.99 us | 3.91 us | 1.02x faster | Significant (t=2.49) | +-------------------------+--------------+-------------+--------------+-----------------------+ | sympy_expand | 559 ms | 559 ms | 1.00x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | sympy_integrate | 25.2 ms | 25.3 ms | 1.00x slower | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | sympy_str | 261 ms | 260 ms | 1.00x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | sympy_sum | 136 ms | 138 ms | 1.01x slower | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | telco | 8.36 ms | 8.32 ms | 1.00x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | tornado_http | 273 ms | 271 ms | 1.01x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | unpack_sequence | 58.8 ns | 60.0 ns | 1.02x slower | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | unpickle | 21.5 us | 21.5 us | 1.00x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | unpickle_list | 5.60 us | 5.55 us | 1.01x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | unpickle_pure_python | 497 us | 481 us | 1.03x faster | Significant (t=5.04) | +-------------------------+--------------+-------------+--------------+-----------------------+ | xml_etree_generate | 141 ms | 140 ms | 1.01x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | xml_etree_iterparse | 131 ms | 133 ms | 1.01x slower | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | xml_etree_parse | 186 ms | 187 ms | 1.00x slower | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+ | xml_etree_process | 115 ms | 113 ms | 1.01x faster | Not significant | +-------------------------+--------------+-------------+--------------+-----------------------+
...so it doesn't appear that my PR introduces a performance regression.
...so it doesn't appear that my PR introduces a performance regression.
IMHO there is no performance regression at all. Just noice in the result which doesn't come with std dev.
FYI, I have a couple of small follow-up changes to land before I close this issue.
I suspect changes for this issue may be creating test_io failures on my windows builders, most notably my x86 Windows 7 builder where test_io has been failing both attempts on almost all builds. It fails with a lock failure during interpreter shutdown, and commit ef4ac967 appears the most plausible commit out of the set introduced at the first failing build on Feb 24.
See https://buildbot.python.org/all/#/builders/58/builds/1977 for the first failure. test_io has failed both attempts on all but 3 of the subsequent 16 tests of the 3.x branch.
It might be worth noting that this builder is slow, so if there are timeouts involved or a race condition of any sort it might be triggering it more readily than other builders. I do, however, see the same failures - albeit less frequently and not usually on both tries - on the Win8 and Win10 builders.
For what it's worth one other oddity is that while having test_multiprocessing* failures are relatively common on the Win7 builder during the first round of tests due to exceeding the timeout, they usually pass on the retry, but over this same time frame have begun failing - not due to timeout - even on the second attempt, which is unusual. It might be coincidental but similar failures are also showing up sporadically on the Win8/Win10 builders where such failures are not common at all.
New changeset b05b711a2cef6c6c381e01069dedac372e0b9fb2 by Eric Snow in branch 'master': bpo-33608: Use _Py_AddPendingCall() in _PyCrossInterpreterData_Release(). (gh-12024) https://github.com/python/cpython/commit/b05b711a2cef6c6c381e01069dedac372e0b9fb2
New changeset bda918bf65a88560ec453aaba0758a9c0d49b449 by Eric Snow in branch 'master': bpo-33608: Simplify ceval's DISPATCH by hoisting eval_breaker ahead of time. (gh-12062) https://github.com/python/cpython/commit/bda918bf65a88560ec453aaba0758a9c0d49b449
I've merged the PR for hoisting eval_breaker before the ceval loop starts. @Raymond, see if that addresses the performance regression you've been seeing.
I'll close this issue after I've sorted out the buildbot issues David mentioned (on his Windows builders).
If I can help with testing or builder access or anything just let me know. It appears that I can pretty reliably trigger the error through just the individual tests (test_daemon_threads_shutdown_std{out,err}_deadlock) in isolation, which is definitely less tedious than a full buildbot run to test a change.
BTW, while not directly related since it was only just merged, and I won't pretend to have any real understanding of the changes here, I do have a question about PR 12024 ... it appears HEAD_UNLOCK is used twice in _PyInterpreterState_LookUpID. Shouldn't one of those be HEAD_LOCK?
I suspect changes for this issue may be creating test_io failures on my windows builders, (...)
I created bpo-36177 to track this regression.
The commit ef4ac967e2f3a9a18330cc6abe14adb4bc3d0465 introduced a regression in test.test_multiprocessing_spawn.WithThreadsTestManagerRestart.test_rapid_restart: bpo-36114.
Would it be possible to revert it until the bug is properly understood and fixed?
The commit ef4ac967e2f3a9a18330cc6abe14adb4bc3d0465 introduced a regression in test.test_multiprocessing_spawn.WithThreadsTestManagerRestart.test_rapid_restart: bpo-36114.
There is a very similar bug on Windows: bpo-36116.
I'm now also quite confident that bpo-36177 is also a regression caused by this issue.
New changeset 4d61e6e3b802399be62a521d6fa785698cb670b5 by Victor Stinner in branch 'master': Revert: bpo-33608: Factor out a private, per-interpreter _Py_AddPendingCall(). (GH-11617) (GH-12159) https://github.com/python/cpython/commit/4d61e6e3b802399be62a521d6fa785698cb670b5
Hi Eric, I'm sorry but I had to revert your recent work. It introduced regressions in at least in test_io and test_multiprocessing_spawn on Windows and FreeBSD. I simply don't have the bandwidth to investigate this regression yet, whereas we really need the CI to remain stable to catch other regressions (the master branch received multiple new features recently, and there are some other regressions by that way ;-)).
test_multiprocessing_spawn is *very* hard to reproduce on FreeBSD, but I can reliably reproduce it. It just takes time. The issue is a crash producing a coredump. I consider that the bug is important enough to justify a revert.
The revert is an opportunity to seat down and track the root cause rather than urging to push a "temporary" quickfix. This bug seems to be very deep in the Python internals: thread state during Python shutdown. So it will take time to fully understand it and fix it (or redesign your recent changes, I don't know).
Tell me if you need my help to reproduce the bug. The bug has been seen on FreeBSD but also Windows:
-- The Night's Watch
For curious people, Pablo Galindo spent a few hours to investigate https://bugs.python.org/issue36114 and I spent a few more hours to finish the analysis. The bug indirectly crashed my laptop :-) https://twitter.com/VictorStinner/status/1102528982036201478
That's what I mean by "I don't have the bandwidth": we already spent hours to identify the regression ;-)
That's okay, Victor. Thanks for jumping on this. I'll take a look when I get a chance.
That's okay, Victor. Thanks for jumping on this. I'll take a look when I get a chance.
From what I saw, your first commit was enough to reproduce the crash.
If I recall correctly, Antoine Pitrou modified the GIL so threads exit immediately when Py_Finalize() is called. I'm thinking at:
void
PyEval_RestoreThread(PyThreadState *tstate)
{
...
take_gil(tstate);
/* _Py_Finalizing is protected by the GIL */
if (_Py_IsFinalizing() && !_Py_CURRENTLY_FINALIZING(tstate)) {
drop_gil(tstate);
PyThread_exit_thread();
Py_UNREACHABLE();
}
...
PyThreadState_Swap(tstate);
}
Problem: this code uses tstate, whereas the crash occurred because tstate pointed to freed memory:
""" Thread 1 got the crash
(gdb) p *tstate $3 = { prev = 0xdbdbdbdbdbdbdbdb, next = 0xdbdbdbdbdbdbdbdb, interp = 0xdbdbdbdbdbdbdbdb, ... }
...
Thread 1 (LWP 100696):
... """
https://bugs.python.org/issue36114#msg337090
When this crash occurred, Py_Finalize() already completed in the main thread!
"""
void _Py_NO_RETURN
Py_Exit(int sts)
{
if (Py_FinalizeEx() < 0) { /* <==== DONE! */
sts = 120;
}
exit(sts); /* \<=============== Crash occurred here! \*/
} """
Py_Finalize() is supposed to wait for threads before deleting Python thread states:
"""
int
Py_FinalizeEx(void)
{
...
/* The interpreter is still entirely intact at this point, and the
* exit funcs may be relying on that. In particular, if some thread
* or exit func is still waiting to do an import, the import machinery
* expects Py_IsInitialized() to return true. So don't say the
* interpreter is uninitialized until after the exit funcs have run.
* Note that Threading.py uses an exit func to do a join on all the
* threads created thru it, so this also protects pending imports in
* the threads created via Threading.
*/
call_py_exitfuncs(interp);
...
/* Remaining threads (e.g. daemon threads) will automatically exit
after taking the GIL (in PyEval_RestoreThread()). */
_PyRuntime.finalizing = tstate;
_PyRuntime.initialized = 0;
_PyRuntime.core_initialized = 0;
...
/* Delete current thread. After this, many C API calls become crashy. */
PyThreadState_Swap(NULL);
PyInterpreterState_Delete(interp);
...
} """
The real problem for years are *deamon threads* which... BY DESIGN... remain alive after Py_Finalize() exit! But as I explained, they must exit as soon at they attempt to get GIL.
Thanks for taking a look Victor! That info is helpful. It will likely be a few days before I can sort this out.
Once I have addressed the problem I'll re-merge. I plan on using the "buildbot-custom" branch to make sure the buildbots are happy with the change before making that PR.
The bug is hard to reproduce even manually. I can test a PR for you once it's ready.
New changeset 5be45a6105d656c551adeee7770afdc3b806fbb5 by Eric Snow in branch 'master': bpo-33608: Minor cleanup related to pending calls. (gh-12247) https://github.com/python/cpython/commit/5be45a6105d656c551adeee7770afdc3b806fbb5
New changeset 8479a3426eb7d1840473f7788e639954363ed37e by Eric Snow in branch 'master': bpo-33608: Make sure locks in the runtime are properly re-created. (gh-12245) https://github.com/python/cpython/commit/8479a3426eb7d1840473f7788e639954363ed37e
New changeset e3f4070aee6f2d489416fdcafd51d6b04d661919 by Victor Stinner in branch 'master': bpo-33608: Fix PyEval_InitThreads() warning (GH-12346) https://github.com/python/cpython/commit/e3f4070aee6f2d489416fdcafd51d6b04d661919
New changeset 842a2f07f2f08a935ef470bfdaeef40f87490cfc by Eric Snow in branch 'master': bpo-33608: Deal with pending calls relative to runtime shutdown. (gh-12246) https://github.com/python/cpython/commit/842a2f07f2f08a935ef470bfdaeef40f87490cfc
New changeset f13c5c8b9401a9dc19e95d8b420ee100ac022208 by Eric Snow in branch 'master': bpo-33608: Factor out a private, per-interpreter _Py_AddPendingCall(). (gh-12360) https://github.com/python/cpython/commit/f13c5c8b9401a9dc19e95d8b420ee100ac022208
I have a bad news for you Eric: I'm able again to reproduce the crash at commit f13c5c8b9401a9dc19e95d8b420ee100ac022208.
vstinner@freebsd$ ./python -m test --matchfile=bisect5 test_multiprocessing_spawn --fail-env-changed -F Run tests sequentially 0:00:00 load avg: 0.69 [ 1] test_multiprocessing_spawn 0:00:06 load avg: 0.80 [ 2] test_multiprocessing_spawn 0:00:12 load avg: 1.19 [ 3] test_multiprocessing_spawn ... 0:01:55 load avg: 1.48 [ 21] test_multiprocessing_spawn 0:02:01 load avg: 1.53 [ 22] test_multiprocessing_spawn 0:02:08 load avg: 1.29 [ 23] test_multiprocessing_spawn 0:02:17 load avg: 1.51 [ 24] test_multiprocessing_spawn 0:02:27 load avg: 2.27 [ 25] test_multiprocessing_spawn 0:02:38 load avg: 3.14 [ 26] test_multiprocessing_spawn 0:02:48 load avg: 3.51 [ 27] test_multiprocessing_spawn Warning -- files was modified by test_multiprocessing_spawn Before: [] After: ['python.core'] test_multiprocessing_spawn failed (env changed)
== Tests result: ENV CHANGED ==
All 26 tests OK.
1 test altered the execution environment: test_multiprocessing_spawn
Total duration: 2 min 59 sec Tests result: ENV CHANGED
Note: Sorry for not testing before, after one long week, I didn't succeed to catch up with my very long list of unread emails.
I don't know what should be done. Revert? I don't have the bandwidth to investigate this crash.
I ran "./python -m test --matchfile=bisect5 test_multiprocessing_spawn --fail-env-changed -F" 4 times in parallel: in less than 5 minutes (in fact, I didn't look carefully at the terminal, maybe it was faster), I got 3 core dumps :-(
Thanks for checking, Victor. Don't feel bad about your results, nor about not checking sooner. :) We'll get this sorted out.
For now I'll revert. This is not code that changes very often, so there isn't much benefit to keeping it merged. Testing against a separate branch is just as easy.
Could you point me at an immage for that VM or instructions on how to reproduce it? I hate having to bother you to test my changes! :)
Eric Snow:
For now I'll revert. This is not code that changes very often, so there isn't much benefit to keeping it merged. Testing against a separate branch is just as easy.
Again, Python shutdown is *really* fragile. Last time I tried to "enhance" it, I also introduced random regressions and so I had to revert my changes.
Old info about the crash, should still be relevant: https://bugs.python.org/issue36114#msg337090
Could you point me at an immage for that VM or instructions on how to reproduce it? I hate having to bother you to test my changes! :)
*In theory*, you should be able to reproduce the crash on any platform. But in my experience, bugs which involve multiple threads are simply "more likely" on FreeBSD because FreeBSD manages threads very differently than Linux. Sometimes, a bug can only be reproduce on one specific FreeBSD computer, but once the root issue has been identified, we start to be able to trigger the crash reliably on other platforms (like Linux).
My procedure to reproduce the crash on FreeBSD: https://bugs.python.org/issue36114#msg337092
I'm using FreeBSD 12.0 RELEASE VM hosted on Linux. My FreeBSD is not customized in any way.
On modern Linux distributions, coredumps are no longer written in the current directory but handled by a system service like ABRT on Fedora. For this reason, Python test runner can "miss" crashes, especially in child processes run by tests (not directly in the process used to run the test).
To get a coredump in the current directory on Linux, you can use:
sudo bash -c 'echo "%e.%p" > /proc/sys/kernel/core_pattern'
Manual test:
$ ./python -c 'import ctypes; ctypes.string_at(0)'
Segmentation fault (core dumped)
vstinner@apu$ git status ... Untracked files: python.18343 ...
Say hello to python.18343 coredump!
Usually, running the command which trigger the crash multiple times in parallel (in different terminals, using screen and multiple terminals, etc.) makes the crash more likely since it does stress the system.
Sometimes, I run the Python test suite in parallel to stress the system even more.
The goal of the game is to trigger a race condition which depends on time. Stressing the system helps to "randomize" timings.
New changeset b75b1a3504a0cea6fac6ecba44c10b2629577025 by Eric Snow in branch 'master': bpo-33608: Revert "Factor out a private, per-interpreter _Py_AddPendingCall()." (gh-12806) https://github.com/python/cpython/commit/b75b1a3504a0cea6fac6ecba44c10b2629577025
I tried but failed to reproduce the crash on Linux!?
$ sudo bash -c 'echo "%e.%p" > /proc/sys/kernel/core_pattern'
$ ./python -m test --matchfile=bisect5 test_multiprocessing_spawn --fail-env-changed -F
# wait 5 min
^C
$ ./python -m test --matchfile=bisect5 -j0 test_multiprocessing_spawn --fail-env-changed -F # I added -j0
# wait 5 min
^C
$ ./python -m test --matchfile=bisect5 -j0 test_multiprocessing_spawn --fail-env-changed -F # I added -j0
# wait 5 min
^C
No coredump seen...
FYI AMD64 FreeBSD CURRENT Shared 3.x failed at commit f13c5c8b9401a9dc19e95d8b420ee100ac022208: https://buildbot.python.org/all/#/builders/168/builds/913
But this issue has already been fixed: Eric reverted his change.
@Victor, I set up a FreeBSD 12.0 VM (in Hyper-v) and made sure core files were getting generated for segfaults. Then I cloned the cpython repo, built it (using GCC), and ran regrtest as you recommended. It generated no core files after half an hour. I adjusted the VM down to 1 CPU from 4 and there were no segfaults over an hour and a half of running those 4 test loops. So I've set the VM to 10% of a CPU and still have gotten no core files after over half an hour. The load average has been hovering between 5 and 6. I guess I'm not starving the VM enough. :)
Any ideas of how far I need to throttle the VM? Is there more than just CPU that I need to limit?
Any ideas of how far I need to throttle the VM? Is there more than just CPU that I need to limit?
I don't know how to make the race condition more likely. I'm not sure that starving the CPU helps. Maybe try the opposite: add more CPUs and reduce the number of tests run in parallel.
Did you test commit f13c5c8b9401a9dc19e95d8b420ee100ac022208?
Eric, I'm also seeing the same Win7 and Win10 worker failures with commit b75b1a350 as last time (test_multiprocessing_spawn on both, and test_io on Win7).
For test_multiprocessing_spawn, it fails differently than Victor since no core file is generated, but I assume it's related in terms of child process termination. See for example https://buildbot.python.org/all/#/builders/3/builds/2390 for Win10, where test_mymanager_context fails with:
\====================================================================== FAIL: test_mymanager_context (test.test_multiprocessing_spawn.WithManagerTestMyManager) ----------------------------------------------------------------------
Traceback (most recent call last):
File "D:\buildarea\3.x.bolen-windows10\build\lib\test\_test_multiprocessing.py", line 2747, in test_mymanager_context
self.assertIn(manager._process.exitcode, (0, -signal.SIGTERM))
AssertionError: 3221225477 not found in (0, -15)
(3221225477 is C0000005 which I believe is an access violation)
For some reason, the Windows 7 worker didn't get a test run while your commit was live, but I can reproduce the same error manually.
For test_io, as before, its a shutdown lock failure:
\====================================================================== FAIL: test_daemon_threads_shutdown_stdout_deadlock (test.test_io.CMiscIOTest) ----------------------------------------------------------------------
Traceback (most recent call last):
File "D:\cygwin\home\db3l\python.test\lib\test\test_io.py", line 4157, in test_daemon_threads_shutdown_stdout_deadlock
self.check_daemon_threads_shutdown_deadlock('stdout')
File "D:\cygwin\home\db3l\python.test\lib\test\test_io.py", line 4148, in check_daemon_threads_shutdown_deadlock
self.assertIn("Fatal Python error: could not acquire lock "
AssertionError: "Fatal Python error: could not acquire lock for <_io.BufferedWriter name='<stdout>'> at interpreter shutdown, possibly due to daemon threads" not found in ''
In manual attempts I have yet to be able to recreate the test_multiprocessing_spawn failure under Win10 but can usually manage a 25-50% failure rate under Win7 (which is much slower). The test_io failure on Win7 however, appears to be more easily reproducible.
It's possible I/O is more critical than CPU, or perhaps its impact on latency; I seem to more easily exacerbate the test_multiprocessing_spawn failure rate by loading down the host disk than its CPU. I also noticed that the Win10 failure was when test_io and test_multiprocessing_spawn overlapped.
While I'm guessing this should happen on any low powered Windows VM, if it would help, I could arrange remote access to the Win7 worker for you. Or just test a change on your behalf. In fairness, it's unlikely to be useful for any significant remote debugging but perhaps at least having somewhere you could test a change, even if just with print-based debugging, might help. And while it might be an independent issue, the test_io failure rate appears to occur more reliably than test_multiprocessing_spawn.
I just noticed that my last message referenced the wrong commit. My test failures were against commit f13c5c8b9401a9dc19e95d8b420ee100ac022208 (the same as Victor).
New changeset 264490797ad936868c54b3d4ceb0343e7ba4be76 by Steve Dower in branch 'master': bpo-33608: Normalize atomic macros so that they all expect an atomic struct (GH-12877) https://github.com/python/cpython/commit/264490797ad936868c54b3d4ceb0343e7ba4be76
@Eric Happy to give you SSH access to one of the FreeBSD buildbots that's experienced the crash if it helps. Just flick me a pubkey, I'm on IRC (koobs @ freenode / #python-dev)
I was able to reproduce the error with version f13c5c8b9401a9dc19e95d8b420ee100ac022208 on FreeBSD 12.0 VM. The error seems to be caused not by those changes, but by lack of synchronization in the multiprocessing.managers.Server. The failure happens when running the "test_shared_memory_SharedMemoryManager_basics" with high CPU load and frequent interrupts e.g. moving some window during test. Mostly I used the "python -m test --fail-env-changed test_multiprocessing_spawn -m 'WithProcessesTestS[hu]*' -F" command to reproduce the crash. By analyzing core dumps I deduced that the crash happens during this call from the parent test process:
class BaseManager(object):
def _finalize_manager(process, address, authkey, state, _Client):
...
try:
conn = _Client(address, authkey=authkey)
try:
dispatch(conn, None, 'shutdown')
finally:
conn.close()
except Exception:
pass
Main thread in the multiprocessing child:
class Server(object):
def serve_forever(self):
...
try:
accepter = threading.Thread(target=self.accepter)
accepter.daemon = True
accepter.start()
try:
while not self.stop_event.is_set():
self.stop_event.wait(1)
except (KeyboardInterrupt, SystemExit):
pass
finally:
...
sys.exit(0) << main thread have finished and destroyed the interpreter
Worker thread in the multiprocessing child. Locals: File "/usr/home/user/cpython/Lib/multiprocessing/managers.py", line 214, in handle_request c.send(msg) self = \<SharedMemoryServer(....)> funcname = 'shutdown' result = None request = (None, 'shutdown', (), {}) ignore = None args = () kwds = {} msg = ('#RETURN', None)
Listing: class Server(object): def handle_request(self, c): ... try: result = func(c, *args, **kwds) \<\< calls Server.shutdown method except Exception: msg = ('#TRACEBACK', format_exc()) else: msg = ('#RETURN', result) try: c.send(msg) \<\< crashes with SIGBUS in _send_bytes -> write -> take_gil -> SET_GIL_DROP_REQUEST(tstate->interp) except Exception as e: try: c.send(('#TRACEBACK', format_exc())) except Exception: pass ... def shutdown(self, c): ... try: util.debug('manager received shutdown message') c.send(('#RETURN', None)) except: import traceback traceback.print_exc() finally: self.stop_event.set()
Worker thread is daemonic and is not terminated during the interpreter finalization, thus it might still be running and is terminated silently when the process exits. The connection (c) has different implementations on several platforms, so we cannot be sure whether the connection is closed during shutdown or not, whether the last "c.send(msg)" blocks until the end of the process, returns instantly, or fails inconsistently. The error was there for a long time, but for two reasons it didn't cause much trouble:
I haven't managed to find a nice clean test to reproduce the bug automatically. I suggest the changes for the multiprocessing/managers.py in the attachment.
Also it might be viable to add some assertion to verify the take_gil is not called with uninitialized interpreter. I used the changes in the attachment (take_gil.assert.patch), but it produced errors during test_tracemalloc with f13c5c8b9401a9dc19e95d8b420ee100ac022208 . It happens because, during startup with invalid arguments, the interpreter is finalized with pymain_main->pymain_free->_PyRuntime_Finalize before the error is printed. However, the problem seems to be fixed for me in the last revisions of master branch, so I upload the diff against it.
That's really helpful, Pavel! Thanks for investigating so thoroughly. I'm going to double check all the places I've made the assumption that "tstate" isn't NULL and likewise for "tstate->interp".
Is there an issue open for the bug in multiprocessing? If not, would you mind opening one?
I've made a few tweaks and Victor did some cleanup, so I'm going to try the PR again. At first I'm also going to disable the _PyEval_FinishPendingCalls() call in _Py_FinalizeEx() and then enable it is a separate PR.
Also, I've opened bpo-37127 specifically to track the problem with _PyEval_FinishPendingCalls() during _Py_FinalizeEx(), to ensure that the disabled call doesn't slip through the cracks. :)
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/ericsnowcurrently' closed_at =
created_at =
labels = ['expert-subinterpreters', '3.9']
title = 'Add a cross-interpreter-safe mechanism to indicate that an object may be destroyed.'
updated_at =
user = 'https://github.com/ericsnowcurrently'
```
bugs.python.org fields:
```python
activity =
actor = 'vstinner'
assignee = 'eric.snow'
closed = True
closed_date =
closer = 'vstinner'
components = ['Subinterpreters']
creation =
creator = 'eric.snow'
dependencies = []
files = ['48333', '48334', '48731', '48732']
hgrepos = []
issue_num = 33608
keywords = ['patch']
message_count = 82.0
messages = ['317333', '317334', '317344', '317404', '325481', '325548', '336490', '336544', '336596', '336602', '336603', '336677', '336687', '336897', '336948', '336951', '336952', '336975', '337075', '337096', '337110', '337115', '337116', '337117', '337159', '337179', '337218', '337235', '337554', '337557', '337996', '338039', '340057', '340069', '340070', '340071', '340075', '340077', '340079', '340082', '340133', '340139', '340143', '340145', '340666', '340768', '342791', '342792', '344210', '344211', '344240', '344244', '344245', '344355', '344372', '344417', '344429', '344434', '344435', '344438', '344441', '344442', '344443', '344444', '344445', '344446', '357169', '357170', '357179', '358364', '358367', '358748', '364670', '365160', '365163', '365178', '366009', '366013', '366016', '366017', '366019', '366027']
nosy_count = 16.0
nosy_names = ['nascheme', 'ncoghlan', 'vstinner', 'phsilva', 'pmpp', 'Mark.Shannon', 'eric.snow', 'serhiy.storchaka', 'maciej.szulik', 'yselivanov', 'koobs', 'steve.dower', 'pconnell', 'emilyemorehouse', 'Johan Dahlin', 'shprotx']
pr_nums = ['9334', '11617', '11617', '11617', '12024', '12062', '12159', '12240', '12243', '12245', '12246', '12247', '12346', '12360', '12665', '12806', '12877', '13714', '13780']
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue33608'
versions = ['Python 3.9']
```