Open Venkat2811 opened 4 months ago
FYI, the performance issue is not a surprise because we turned off the specialization, and deferred reference counting is not yet implemented for free-threading version.
But worth to checkout if there is unexpected bottleneck here.
Please use 3.13 beta 1. There were a number of scaling bottlenecks fixed between 3.13 alpha 6 and beta 1 (https://github.com/python/cpython/issues/118527).
On my machine, I see a speedup in the free-threaded build vs. the default bulid (2.4s vs. 8.2s)
With ed2b0fb04474725a38312784e1695a1cd7c0cce1 number of thread: os.cpu_count() task: fib(5)
Thanks for confirming that things have been fixed in beta1 @colesbury
I'm waiting for beta1 build to be available on deadsnakes. I tried building from source, but unfortunately after trying several --with-ssl
solutions, still getting not found errors.
@colesbury I installed 3.13.0b1
via pyenv.
env PYTHON_CONFIGURE_OPTS='--disable-gil' pyenv install 3.13.0b1
Is this the way to create free-threaded build ?
On my M1 MBP: --disable-gil vs default build: 15.5s vs 5.26s
On my AMD Ryzen 7 5800X 8-Core Processor, Ubuntu: --disable-gil vs default build: 10.5s vs 6.4s
It has improved since 3.13.a06
for sure
cpython-3.13.0b1 on Windows 11, for 20 threads, on a 4 cores/8 threads intel cpu: (faster is better)
This is a slightly different benchmark, but there are some potentially useful data points about the speedup with free-threading here: https://github.com/winpython/winpython/issues/1339.
CPython3.13-b3 without GIL is slower in 2x than with GIL.
System Ubuntu24. python3.13.b03-nogil was downloaded from apt.
The program actually works in multi-core mode with nogil.
I'm doing true concurrency testing for Python3.13.0b3. And according to my results, the result is deplorable. Let me know if I'm not testing multithreading correctly. What could be the problem with reduced productivity?
from array import array
from concurrent.futures import ThreadPoolExecutor
from time import time
array_size = 100_000
a = array('b', [0 for i in range(array_size)])
def write_by_index(array, indx):
array[indx] = 1
start = time()
with ThreadPoolExecutor(max_workers=6) as executor:
for index in range(array_size ):
executor.submit(write_by_index, a, index)
end = time() - start
print(end)
My results.
python3.13 main.py
>>>1.0930209159851074
python3.13-nogil main.py
>>>1.9819214344024658
Next test where array_size = 10_000_000
python3.13 main.py
>>>117.05215215682983
python3.13-nogil main.py
>>>211.70072531700134
PS.
There's a funny moment here. It seems I was using global data, which may have entailed locking overhead, mutexes, if I can judge correctly. I tried testing with local data with my scope. This is where I got a 3x performance increase without GIL.
from concurrent.futures import ThreadPoolExecutor
from time import time
array_size = 1_000_000
count_tasks = 100
def write_by_index(size):
# some working with local data
array = [0 for i in range(size)]
return array
start = time()
with ThreadPoolExecutor(max_workers=6) as executor:
for index in range(count_tasks):
executor.submit(write_by_index, array_size)
end = time() - start
print(end)
time python3.13 main.py
>>>2.7294483184814453
real 0m2.782s
user 0m2.798s
sys 0m0.095s
time python3.13-nogil main.py
>>>1.062140703201294
real 0m1.154s
user 0m5.511s
sys 0m0.674s
Bug report
Bug description:
Hello Team,
Thanks for the great work so far and recent nogil efforts. I wanted to explore performance of asyncio with CPU-GPU bound multi-threading in nogil setup. So, started with simple benchmark as seen below:
Original Source
Results:
libpython3.13-nogil amd64 3.13.0~a6-1+jammy2
source nogil: htop shows 9 running tasks, and 8 cores close to 100% utilization, but slower.My CPU:
AMD Ryzen 7 5800X 8-Core Processor
OS:Ubuntu 22.04.4 LTS
So looks like there is overhead when using multiple cores. Is this expected with this version ? Are results similar with Intel & M1 CPUs as well ?
Results documented here in version
3.9.12
on Intel is better.CPython versions tested on:
3.13
Operating systems tested on:
Linux