Open xbit18 opened 1 month ago
Duplicate of #118749
Doesn't seem like a duplicate to me. The version is different, he was using 3.13.0a6, mine's beta 1, and he had problems with the fibonacci script, which works ok for me. @Eclips4
Yeah, you are going to encounter contention on the shared lists: both on the per-list locks and the reference count fields.
Yeah, you are going to encounter contention on the shared lists: both on the per-list locks and the reference count fields.
Ok so just to be clear: this is expected behavior due to the fact that the free threading implementation is still incomplete, or it would behave the same if it was fully implemented?
This is the expected behavior -- it is not changing.
Ok thank you. Knowing this I changed the code so that it doesn't use a shared list "result" but thread-local results which are then combined. It doesn't really seem to be having any effect. Am I missing something?
Screen of different timings for the same code execution with different python versions (3.13 is free threading)
Also, I don't know how it can help but I've noticed that incrementing the number of threads seems to make thing worse. For example, using 2 threads I got ~20 seconds, using 8 I got 40 seconds and using 16 I got ~50 seconds. Screen of different timings for different number of threads specified (all with 3.13.0b1)
This is the updated code, as you can see it doesn't use shared lists anymore but every thread creates a local list which it returns and then all the lists are combined:
import time
import random
from concurrent.futures import ThreadPoolExecutor
def multiply_row(A, B, row_index, local_result):
num_columns_B = len(B[0])
num_columns_A = len(A[0])
for j in range(num_columns_B):
sum = 0
for k in range(num_columns_A):
sum += A[row_index][k] * B[k][j]
local_result[row_index][j] = sum
def parallel_matrix_multiplication(a, b, start_row, end_row):
local_result = [[0] * len(b[0]) for _ in range(len(a))]
for row_index in range(start_row, end_row):
multiply_row(a, b, row_index, local_result)
return local_result
def multi_threaded_matrix_multiplication(a, b, num_threads):
num_rows = len(a)
result = []
for _ in range(num_rows):
result.append([0] * len(b[0]))
row_chunk = num_rows // num_threads
futures = []
with ThreadPoolExecutor(max_workers=num_threads) as executor:
for i in range(num_threads):
start_row = i * row_chunk
end_row = (i + 1) * row_chunk if i != num_threads - 1 else num_rows
future = executor.submit(parallel_matrix_multiplication, a, b, start_row, end_row)
futures.append(future)
results = [future.result() for future in futures]
# Combine local results into the final result matrix
for local_result in results:
for i in range(num_rows):
for j in range(len(b[0])):
result[i][j] += local_result[i][j]
return result
# Helper function to create a random matrix
def create_random_matrix(rows, cols):
return [[random.random() for _ in range(cols)] for _ in range(rows)]
def main():
size = 500 # Define matrix size
a = create_random_matrix(size, size)
b = create_random_matrix(size, size)
num_threads = 8 # Define number of threads
start = time.perf_counter()
result = multi_threaded_matrix_multiplication(a, b, num_threads)
print("Matrix multiplication completed.", time.perf_counter() - start, "seconds.")
if __name__ == "__main__":
main()
sorry guys,
where I can download JIT+noGIL build for windows for testing? i don't want to mess with the compilation
sorry guys,
where I can download JIT+noGIL build for windows for testing? i don't want to mess with the compilation
I believe you must download it from here https://www.python.org/downloads/release/python-3130b2/ and select the experimental features you want from the customization options during installation
@xbit18 any embeddable builds ?
@xbit18 any embeddable builds ?
I believe there is, just scroll through the options
This is the updated code, as you can see it doesn't use shared lists anymore but every thread creates a local list which it returns and then all the lists are combined
It looks like the two random matrices (a and b) are still shared across threads. Even though it's read-only, that would still create lock contention, IIUC.
It looks like the two random matrices (a and b) are still shared across threads. Even though it's read-only, that would still create lock contention, IIUC.
You are completely right and I realised this half a second before reading your reply, I feel so stupid. I just had to do "a.copy()" in the executor.submit() call and the problem disappeared. The performance of the free threading build is still on par and not better than the non free threading ones but at least it's not worse by miles. I don't know if it matters but the GIL version of 3.13.0b1 performs worse than the other versions:
3.9.10
Matrix multiplication completed. 1.731310041 seconds.
nogil-3.9.10-1
Matrix multiplication completed. 0.590554667 seconds.
3.9.18
Matrix multiplication completed. 1.6376172500000001 seconds.
3.10.13
Matrix multiplication completed. 1.6486839579993102 seconds.
3.11.8
Matrix multiplication completed. 1.0865809580009227 seconds.
3.12.2
Matrix multiplication completed. 1.072277084000234 seconds.
3.13.0b1 without GIL
Matrix multiplication completed. 1.3272077920009906 seconds.
3.13.0b1 with GIL
Matrix multiplication completed. 2.8077587080006197 seconds.
Oh, something to mention: all the python versions which I'm comparing to 3.13 have been installed with pyenv using the "--with-lto" and "--enable-optimizations" options, while 3.13 hasn't because it wouldn't build with them. Perhaps it could be the cause of this?
I feel so stupid.
This is not stupid! I apologize if I made you feel that way. Showing an interest in this stuff, and discussing it in the open is extremely valuable to making this all better, so thank you.
I don't know if it matters but the GIL version of 3.13.0b1 performs worse than the other versions.
This is an interesting finding, and something my team (the Faster CPython team) tracks pretty closely... but since we are mostly interested in single-threaded performance, we don't have any benchmarks that are CPU-bound, multithreaded-with-a-GIL like this. It's interesting to know there might be a regression there.
Oh, something to mention: all the python versions which I'm comparing to 3.13 have been installed with pyenv using the "--with-lto" and "--enable-optimizations" options, while 3.13 hasn't because it wouldn't build with them. Perhaps it could be the cause of this?
It could be, but the difference in this case seems much larger than I would expect from PGO/LTO. I'm definitely intrigued enough to look into it further, and I'll link back here if anything comes of it.
I just had to do "a.copy()" in the executor.submit()
a
is a list of lists of numbers. a.copy()
makes a shallow copy. If you want to avoid reference count contention you need to copy the inner lists and the numbers themselves.
https://gist.github.com/colesbury/e2b0e050556da5cb57987d334df87203
This is not stupid! I apologize if I made you feel that way. Showing an interest in this stuff, and discussing it in the open is extremely valuable to making this all better, so thank you.
No absolutely you didn't, don't worry! I was joking because I've been struggling with this for like two days and the solution struck me like a brick because it was really obvious!
a
is a list of lists of numbers.a.copy()
makes a shallow copy. If you want to avoid reference count contention you need to copy the inner lists and the numbers themselves.https://gist.github.com/colesbury/e2b0e050556da5cb57987d334df87203
Oh this is really helpful thank you! I'm running some other tests with the new code and it does actually seem to make a difference! For example these are some partial results with only 3 versions (size=1000 and threads=8):
3.11.8
Matrix multiplication completed. 72.75900166700012 seconds.
3.12.2
Matrix multiplication completed. 70.79678879200219 seconds.
3.13.0b1 without GIL
Matrix multiplication completed. 52.69710216600288 seconds.
3.13.0b1 with GIL
Matrix multiplication completed. 151.24306100000103 seconds.
It seems to be clear that the larger the matrix gets, the larger the impact of free threading is. Interestingly, as I said, the free-threading build WITH the GIL (-Xgil=1) does a lot worse.
Interestingly, as I said, the free-threading build WITH the GIL (-Xgil=1) does a lot worse.
This is not surprising. When you have a free-threaded build and then run it with the GIL on (-Xgil=1), it has to check whether to use the GIL at runtime, in addition to adding a bunch of locks that aren't really strictly necessary when the GIL is turned on (unless I'm misunderstanding, and I might be). In other words, a free-threading build with the GIL turned on is intended to be correct, but I don't think it's intended to give the best performance. If you really want to do a performance measurement of 3.13 with the GIL on, you should use a non-free-threading build rather than the -Xgil=1
flag.
This is not surprising. When you have a free-threaded build and then run it with the GIL on (-Xgil=1), it has to check whether to use the GIL at runtime, in addition to adding a bunch of locks that aren't really strictly necessary when the GIL is turned on (unless I'm misunderstanding, and I might be). In other words, a free-threading build with the GIL turned on is intended to be correct, but I don't think it's intended to give the best performance. If you really want to do a performance measurement of 3.13 with the GIL on, you should use a non-free-threading build rather than the
-Xgil=1
flag.
Thank you very much for this, I'll gladly check the non free threading version as well!
Bug report
Bug description:
Hello, I'm writing a thesis on free threading python and thus I'm testing the 3.13.0b1 with --disable-gil. I installed it with pyenv using this command
I didn't specify --enable-optimizations and --with-lto because with those the build would fail. Now, I'm writing a benchmark to compare the free threading python with past versions of normal python and even with the 3.9.10 nogil python. Here's the problem. The benchmark is a simple matrix-matrix multiplication script that splits the matrix into rows and distributes the rows to a specified number of threads. This is the complete code:
When I ran this code with these versions of python (3.9.10, nogil-3.9.10, 3.10.13, 3.11.8, 3.12.2) the maximum running time is ~13 seconds with normal 3.9.10, the minimum is ~5 seconds with nogil 3.9.10. When I run it with 3.13.0b1, the time skyrockets to ~48 seconds. I tried using cProfile to profile the code but it freezes and never outputs anything (with 3.13, with other versions it works), instead the cpu goes to 100% usage, which makes me think it doesn't use multiple cores, since nogil 3.9 goes to >600% usage, and never stops unless I kill the process.
The basic fibonacci test works like a charm, so I know the --disable-gil build succeded.
All of this is done on a Macbook Air M1 with 16 GB of RAM and 8 cpu cores.
CPython versions tested on:
3.9, 3.10, 3.11, 3.12, 3.13
Operating systems tested on:
macOS