Open colesbury opened 1 week ago
Hmm, actually it looks like a thread safety issue in methodcaller
.
Yeah, I noticed early it wasn't triggering if I called the descriptors directly without going through methodcaller
.
I think the optimized vectorcall
implementation from https://github.com/python/cpython/pull/107201 is not reentrant or thread-safe (even with the GIL) because the shared vectorcall_args
array is modified during each invocation:
Most (or all) of our current vectorcall implementations extract out args[0]
early, but that's not guaranteed and the vectorcall protocol is a public API.
For now, I'm planning to just disable the optimization in the free threading build (https://github.com/python/cpython/pull/127109), but I don't see an easy way of making it thread-safe, and I think we should consider reverting #107201 in 3.13 and main.
cc @eendebakpt @corona10 @vstinner
I agree this looks like a bug also in the normal build. The issue is hard to trigger in the normal build though, as indeed most vectorcall implementations (including the PyObject_VectorcallMethod
used by the methodcaller) get args[0]
before doing any work).
One option to make the vectorcall implementation thread-safe (and re-entrant) would be to copy the entire mc->vectorcall_args
to a temporary buffer before setting the first element to args[0]
. An implementation of this idea is https://github.com/python/cpython/compare/main...eendebakpt:methodcaller_ft.
A quick benchmark (using the script from https://github.com/python/cpython/pull/107201) shows there is still a significant performance increase:
Current main:
call: Mean +- std dev: 48.2 ns +- 4.1 ns
creation: Mean +- std dev: 58.3 ns +- 3.4 ns
creation+call: Mean +- std dev: 114 ns +- 5 ns
call kwarg: Mean +- std dev: 63.8 ns +- 3.7 ns
creation kwarg: Mean +- std dev: 92.7 ns +- 3.1 ns
creation+call kwarg: Mean +- std dev: 187 ns +- 7 ns
Disable the vectorcall (this is the solution Sam added to the FT build):
call: Mean +- std dev: 78.2 ns +- 3.2 ns
creation: Mean +- std dev: 56.6 ns +- 2.7 ns
creation+call: Mean +- std dev: 136 ns +- 6 ns
call kwarg: Mean +- std dev: 124 ns +- 6 ns
creation kwarg: Mean +- std dev: 92.3 ns +- 4.1 ns
creation+call kwarg: Mean +- std dev: 228 ns +- 10 ns
Copy mc->vectorcall_args
to a temporary buffer:
call: Mean +- std dev: 50.9 ns +- 1.6 ns
creation: Mean +- std dev: 56.6 ns +- 4.8 ns
creation+call: Mean +- std dev: 118 ns +- 6 ns
call kwarg: Mean +- std dev: 64.8 ns +- 2.5 ns
creation kwarg: Mean +- std dev: 91.9 ns +- 4.9 ns
creation+call kwarg: Mean +- std dev: 194 ns +- 9 ns
(all benchmarks without PGO and with the normal Python build)
@colesbury Is this idea an option for the free-threading build? @Yhg1s What would you like to do for the 3.13 branch?
An implementation of this idea is https://github.com/python/cpython/compare/main...eendebakpt:methodcaller_ft.
Can you create a PR? I cannot open the link, it takes too long and GitHub displays an error.
I removed the draft status from the PR (thanks to @colesbury for some good suggestions). Both PR #127109 or reverting #107201 are good options I think (a classic trade-off between performance and added complexity, except that in this case the status quo is not good)
Bug report
EDIT: edited to clarify that the issue is in the C implementation of
operator.methodcaller
.Originally reported by @ngoldbaum in https://github.com/crate-py/rpds/issues/101
Reproducer
```python from operator import methodcaller from concurrent.futures import ThreadPoolExecutor class HashTrieMap(): def keys(self): return None def values(self): return None def items(self): return None num_workers=1000 views = [methodcaller(p) for p in ["keys", "values", "items"]] def work(view): m, d = HashTrieMap(), {} view(m) view(d) iterations = 10 for _ in range(iterations): executor = ThreadPoolExecutor(max_workers=num_workers) for view in views: futures = [executor.submit(work, view) for _ in range(num_workers)] results = [future.result() for future in futures] ```Once every 5-10 runs, the program prints:
The problem is that
operator.methodcaller
is not thread-safe because it modifies thevectorcall_args
, which is shared across calls:https://github.com/python/cpython/blob/0af4ec30bd2e3a52350344d1011c0c125d6dcd71/Modules/_operator.c#L1646-L1666
I think this is generally unsafe, not just for free threading. The
vectorcall
args array needs to be valid for the duration of the call, and it's possible formethodcaller
to be called reentrantly or by another thread while the call is still ongoing.Linked PRs