Open Yiling-J opened 2 weeks ago
The problem here is reference count contention. Each thread has to acquire the lock for client.get
to manage its reference count, which is causing a slowdown. The reason this is only apparent for methods and not functions is that by default, Python will enable deferred reference counting (moving reference counts to the garbage collection cycle) for function objects, but not method objects.
Unfortunately, we can't enable it for that many objects, because it extends the lifetime of objects. Many rely on things like __del__
to be called right after something goes out of scope (even though they shouldn't). However, adding an API to explicitly do this might be worthwhile (cc @colesbury)
I've spun up a toy build with an API to enable deferred reference counting inside gc
, and that significantly helps the performance. On my end, here's what the benchmark is without deferred reference counts:
no threading class
265,006,848
threading class
482,098,951
482,566,769
486,775,336
488,567,058
493,592,728
494,866,087
And here it is with it enabled for client
and client.get
:
no threading class
235,182,978
threading class
290,714,004
299,124,424
302,431,219
302,047,238
304,002,698
305,301,686
I don't think we should expose a Python-level API for deferred rc. It makes other implementations worse off and exposes yet another implementation detail to the user. Exposing an unstable C-level API still makes sense.
It makes other implementations worse off
Yeah, which is why it would make sense to give that tradeoff to the user.
Anyways, something like PyUnstable_Object_SetDeferredRefcount
sounds reasonable. Though, we do need to account for the assertions inside _PyObject_SetDeferredRefcount
and raise an exception if they aren't met instead.
I've created #123635 to add an API for this. I'm leaving it as a draft for now, though.
For developers unfamiliar with the C API (me), a Python-level API would be much more accessible, perhaps something like a class attribute __unsafe_deferred_rc__ = True
. While some code might require __del__
to be called immediately, I think most code does not. BTW method
is also mentioned in PEP 703 Deferred Reference Counting
I don't think we have a way to denote unstable Python APIs, do we?
BTW
method
is also mentioned in PEP 703 Deferred Reference Counting
Oh! That's good to hear. Adding deferred RC to client.get
probably did nothing then, but it should have helped client
.
In my opinion, we should avoid providing implementation details to the C API as much as possible. Deferred reference counting is one of the implementation details of CPython itself, not Python.
cc @vstinner
That's why my PR puts it in PyUnstable
. Deferred RC is handy for speeding up free-threading, as shown in this case -- it would be nice if we let users take advantage of it.
Here are the results from 32 cores and 32 threads:
No threading class:
43,692,595
Threading class:
3,148,864,844
3,214,454,065
3,354,286,011
...
No threading function:
38,085,539
Threading function:
83,932,004
87,322,433
87,612,766
...
The cost of contention is extremely high. PEP 703 mentions, "A few types of objects, such as top-level functions, code objects, modules, and methods, tend to be frequently accessed by many threads concurrently." However, it seems that while functions have Deferred Reference Counting, methods do not. Python users might not realize this when using free-threading Python, and if they don't read the C API documentation, they may not know that methods can become a bottleneck at scale and they need to enable it manually using C API.
PEP 703 also mentions that lists will have a fast path. However, based on my tests, lists are slow, while tuples perform very quickly.
A benchmark on global list access:
from threading import Thread
import time
gl = [1]
def getg(i):
return gl[0]
def getl(i):
return [1][0]
def bench_run(runner):
s = time.monotonic_ns()
for i in range(500000):
runner(i)
d = time.monotonic_ns() - s
print(f"{d:,}")
def bench_run_parallel(count, runner):
tl = []
for i in range(count):
t = Thread(target=bench_run, args=[runner])
tl.append(t)
t.start()
for t in tl:
t.join()
if __name__ == '__main__':
print("no threading global list")
bench_run(getg)
print("\nthreading global list")
bench_run_parallel(6, getg)
print("\nno threading local list")
bench_run(getl)
print("\nthreading local list")
bench_run_parallel(6, getl)
no threading global list
49,554,181
threading global list
592,800,310
593,596,259
595,105,848
600,674,214
604,215,793
605,159,285
no threading local list
59,097,610
threading local list
63,532,560
64,400,105
65,673,563
65,818,533
67,569,054
68,702,698
Use tuple as global list(gl = (1,)
):
no threading global list
46,807,489
threading global list
48,496,945
49,355,437
49,688,850
49,730,008
50,377,615
49,933,182
Ah, this might be a bug: per PEP 703, deferred reference counting is enabled for the actual function
objects stored on a type, but not for the bound methods returned by the descriptors -- I'm not sure that was intentional.
If we really want to expose a Python API, let's add something unspecific to deferred RC -- something like sys.thread_hot()
could work. That could enable deferred RC as an implementation detail, but would be allowed to change internally in the future; the only key factor that can't change is that it marks an object as "hot" across threads (I think that's pretty universal, and not exposing any implementation details).
@ZeroIntensity I might have found another bug. In my global list benchmark, if I use gl = tuple([1])
, still a tuple but performance slows down again.
Both are slow on main branch (main:d8c0fe1). I'm not sure if this is intentional, but if the contention case is expected, then it might not be a bug.
Python 3.14.0a0 experimental free-threading build (heads/main:d8c0fe1, Sep 18 2024, 09:23:28) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
no threading class
44,270,189
threading class
432,229,283
434,219,158
434,082,051
435,817,421
436,193,162
436,824,350
no threading function
34,629,009
threading function
310,946,617
310,733,662
314,075,160
314,107,274
314,759,811
314,907,503
I wouldn't expect any speedup on a build with the GIL enabled. Since there's no I/O here, the GIL is never released when waiting for something, and since only one thread can execute at a time, the result is just extra overhead introduced by threading.
@ZeroIntensity Thanks for keeping an eye on my issue. From what I understand, the experimental free-threading build
means I'm already using a no-GIL build? Let me summarize my findings. I always use 6 threads on a 6-core machine:
tuple([1])
is significantly slower than (1,)
on the 3.13.0rc1 free-threading build.
Bug report
Bug description:
Processor: 2.6 GHz 6-Core Intel Core i7
Python 3.13.0rc1 experimental free-threading build (main, Sep 3 2024, 09:59:18) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
When using 6 threads, the instance method becomes approximately 10 times slower, while the function method shows comparable performance.
Python 3.12.5 (main, Sep 2 2024, 15:11:13) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin
CPython versions tested on:
3.13
Operating systems tested on:
macOS
Linked PRs