Yiling-J commented 2 weeks ago

Bug report

Bug description:

from threading import Thread
import time

class Cache:
    def get(self, i):
        return i

def get(i):
    return i

client = Cache()

def bench_run(runner):
    s = time.monotonic_ns()
    for i in range(500000):
        runner(i)
    d = time.monotonic_ns() - s
    print(f"{d:,}")

def bench_run_parallel(count, runner):
    tl = []
    for i in range(count):
        t = Thread(target=bench_run, args=[runner])
        tl.append(t)
        t.start()

    for t in tl:
        t.join()

if __name__ == '__main__':
    print("no threading class")
    bench_run(client.get)
    print("\nthreading class")
    bench_run_parallel(6, client.get)

    print("\nno threading function")
    bench_run(get)
    print("\nthreading function")
    bench_run_parallel(6, get)

Processor: 2.6 GHz 6-Core Intel Core i7

PYTHON_CONFIGURE_OPTS='--disable-gil' pyenv install 3.13.0rc1

PYTHON_GIL=0 python benchmarks/run.py

Python 3.13.0rc1 experimental free-threading build (main, Sep 3 2024, 09:59:18) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin

no threading class
34,626,126

threading class
313,298,115
318,659,929
319,484,384
321,183,006
320,650,898
321,589,085

no threading function
29,890,615

threading function
31,336,844
31,927,592
32,087,732
33,612,093
32,611,520
33,897,321

When using 6 threads, the instance method becomes approximately 10 times slower, while the function method shows comparable performance.

Python 3.12.5 (main, Sep 2 2024, 15:11:13) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin

no threading class
27,588,484

threading class
79,383,431
113,182,075
111,067,075
107,679,440
63,825,808
92,679,231

no threading function
26,200,075

threading function
76,489,523
105,636,550
121,436,736
106,408,462
104,572,564
70,686,101

CPython versions tested on:

3.13

Operating systems tested on:

macOS

Linked PRs

gh-123635

ZeroIntensity commented 2 weeks ago

The problem here is reference count contention. Each thread has to acquire the lock for client.get to manage its reference count, which is causing a slowdown. The reason this is only apparent for methods and not functions is that by default, Python will enable deferred reference counting (moving reference counts to the garbage collection cycle) for function objects, but not method objects.

Unfortunately, we can't enable it for that many objects, because it extends the lifetime of objects. Many rely on things like __del__ to be called right after something goes out of scope (even though they shouldn't). However, adding an API to explicitly do this might be worthwhile (cc @colesbury)

I've spun up a toy build with an API to enable deferred reference counting inside gc, and that significantly helps the performance. On my end, here's what the benchmark is without deferred reference counts:

no threading class
265,006,848

threading class
482,098,951
482,566,769
486,775,336
488,567,058
493,592,728
494,866,087

And here it is with it enabled for client and client.get:

no threading class
235,182,978

threading class
290,714,004
299,124,424
302,431,219
302,047,238
304,002,698
305,301,686

Fidget-Spinner commented 2 weeks ago

I don't think we should expose a Python-level API for deferred rc. It makes other implementations worse off and exposes yet another implementation detail to the user. Exposing an unstable C-level API still makes sense.

ZeroIntensity commented 2 weeks ago

It makes other implementations worse off

Yeah, which is why it would make sense to give that tradeoff to the user.

Anyways, something like PyUnstable_Object_SetDeferredRefcount sounds reasonable. Though, we do need to account for the assertions inside _PyObject_SetDeferredRefcount and raise an exception if they aren't met instead.

ZeroIntensity commented 2 weeks ago

I've created #123635 to add an API for this. I'm leaving it as a draft for now, though.

Yiling-J commented 2 weeks ago

For developers unfamiliar with the C API (me), a Python-level API would be much more accessible, perhaps something like a class attribute __unsafe_deferred_rc__ = True. While some code might require __del__ to be called immediately, I think most code does not. BTW method is also mentioned in PEP 703 Deferred Reference Counting

ZeroIntensity commented 2 weeks ago

I don't think we have a way to denote unstable Python APIs, do we?

BTW method is also mentioned in PEP 703 Deferred Reference Counting

Oh! That's good to hear. Adding deferred RC to client.get probably did nothing then, but it should have helped client.

corona10 commented 1 week ago

In my opinion, we should avoid providing implementation details to the C API as much as possible. Deferred reference counting is one of the implementation details of CPython itself, not Python.

cc @vstinner

ZeroIntensity commented 1 week ago

That's why my PR puts it in PyUnstable. Deferred RC is handy for speeding up free-threading, as shown in this case -- it would be nice if we let users take advantage of it.

Yiling-J commented 1 week ago

Here are the results from 32 cores and 32 threads:

No threading class:
43,692,595

Threading class:
3,148,864,844
3,214,454,065
3,354,286,011
...

No threading function:
38,085,539

Threading function:
83,932,004
87,322,433
87,612,766
...

The cost of contention is extremely high. PEP 703 mentions, "A few types of objects, such as top-level functions, code objects, modules, and methods, tend to be frequently accessed by many threads concurrently." However, it seems that while functions have Deferred Reference Counting, methods do not. Python users might not realize this when using free-threading Python, and if they don't read the C API documentation, they may not know that methods can become a bottleneck at scale and they need to enable it manually using C API.

PEP 703 also mentions that lists will have a fast path. However, based on my tests, lists are slow, while tuples perform very quickly.

A benchmark on global list access:

from threading import Thread
import time

gl = [1]

def getg(i):
    return gl[0]

def getl(i):
    return [1][0]

def bench_run(runner):
    s = time.monotonic_ns()
    for i in range(500000):
        runner(i)
    d = time.monotonic_ns() - s
    print(f"{d:,}")

def bench_run_parallel(count, runner):
    tl = []
    for i in range(count):
        t = Thread(target=bench_run, args=[runner])
        tl.append(t)
        t.start()

    for t in tl:
        t.join()

if __name__ == '__main__':
    print("no threading global list")
    bench_run(getg)
    print("\nthreading global list")
    bench_run_parallel(6, getg)

    print("\nno threading local list")
    bench_run(getl)
    print("\nthreading local list")
    bench_run_parallel(6, getl)

no threading global list
49,554,181

threading global list
592,800,310
593,596,259
595,105,848
600,674,214
604,215,793
605,159,285

no threading local list
59,097,610

threading local list
63,532,560
64,400,105
65,673,563
65,818,533
67,569,054
68,702,698

Use tuple as global list(gl = (1,)):

no threading global list
46,807,489

threading global list
48,496,945
49,355,437
49,688,850
49,730,008
50,377,615
49,933,182

ZeroIntensity commented 1 week ago

Ah, this might be a bug: per PEP 703, deferred reference counting is enabled for the actual function objects stored on a type, but not for the bound methods returned by the descriptors -- I'm not sure that was intentional.

If we really want to expose a Python API, let's add something unspecific to deferred RC -- something like sys.thread_hot() could work. That could enable deferred RC as an implementation detail, but would be allowed to change internally in the future; the only key factor that can't change is that it marks an object as "hot" across threads (I think that's pretty universal, and not exposing any implementation details).

Yiling-J commented 1 week ago

@ZeroIntensity I might have found another bug. In my global list benchmark, if I use gl = tuple([1]), still a tuple but performance slows down again.

Yiling-J commented 4 hours ago

Both are slow on main branch (main:d8c0fe1). I'm not sure if this is intentional, but if the contention case is expected, then it might not be a bug.

Python 3.14.0a0 experimental free-threading build (heads/main:d8c0fe1, Sep 18 2024, 09:23:28) [Clang 15.0.0 (clang-1500.0.40.1)] on darwin

no threading class
44,270,189

threading class
432,229,283
434,219,158
434,082,051
435,817,421
436,193,162
436,824,350

no threading function
34,629,009

threading function
310,946,617
310,733,662
314,075,160
314,107,274
314,759,811
314,907,503

ZeroIntensity commented 2 hours ago

I wouldn't expect any speedup on a build with the GIL enabled. Since there's no I/O here, the GIL is never released when waiting for something, and since only one thread can execute at a time, the result is just extra overhead introduced by threading.

Yiling-J commented 1 hour ago

@ZeroIntensity Thanks for keeping an eye on my issue. From what I understand, the experimental free-threading build means I'm already using a no-GIL build? Let me summarize my findings. I always use 6 threads on a 6-core machine:

On Python 3.13.0rc1 (free-threading build), instance methods in each thread are 10x slower than the non-threaded version, but function performance is comparable.
On Python 3.14.0a1 (free-threading build, main branch), instance methods are still slow, and now functions are also 10x slower.
Multi-threaded read-only access to tuple([1]) is significantly slower than (1,) on the 3.13.0rc1 free-threading build.

python / cpython

Instance method performance issue with free threading (3.13rc1) #123619

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs