python / cpython

The Python programming language
https://www.python.org/
Other
60.89k stars 29.39k forks source link

3.130b1 Performance Issue with Free Threading build #120040

Open xbit18 opened 1 month ago

xbit18 commented 1 month ago

Bug report

Bug description:

Hello, I'm writing a thesis on free threading python and thus I'm testing the 3.13.0b1 with --disable-gil. I installed it with pyenv using this command

env PYTHON_CONFIGURE_OPTS='--disable-gil' pyenv install 3.13.0b1

I didn't specify --enable-optimizations and --with-lto because with those the build would fail. Now, I'm writing a benchmark to compare the free threading python with past versions of normal python and even with the 3.9.10 nogil python. Here's the problem. The benchmark is a simple matrix-matrix multiplication script that splits the matrix into rows and distributes the rows to a specified number of threads. This is the complete code:

import threading
import time
import random

def multiply_row(A, B, row_index, result):
    # Compute the row result
    num_columns_B = len(B[0])
    num_columns_A = len(A[0])
    for j in range(num_columns_B):
        sum = 0
        for k in range(num_columns_A):
            sum += A[row_index][k] * B[k][j]
        result[row_index][j] = sum

def parallel_matrix_multiplication(a, b, result, row_indices):
    for row_index in row_indices:
        multiply_row(a, b, row_index, result)

def multi_threaded_matrix_multiplication(a, b, num_threads):
    num_rows = len(a)
    result = [[0] * len(b[0]) for _ in range(num_rows)]
    row_chunk = num_rows // num_threads

    threads = []
    for i in range(num_threads):
        start_row = i * row_chunk
        end_row = (i + 1) * row_chunk if i != num_threads - 1 else num_rows
        thread = threading.Thread(target=parallel_matrix_multiplication, args=(a, b, result, range(start_row, end_row)))
        threads.append(thread)
        thread.start()

    for thread in threads:
        thread.join()

    return result

# Helper function to create a random matrix
def create_random_matrix(rows, cols):
    return [[random.random() for _ in range(cols)] for _ in range(rows)]

def main():
    size = 500  # Define matrix size
    a = create_random_matrix(size, size)
    b = create_random_matrix(size, size)
    num_threads = 8  # Define number of threads

    start = time.perf_counter()

    result = multi_threaded_matrix_multiplication(a, b, num_threads)
    print("Matrix multiplication completed.", time.perf_counter() - start, "seconds.")

if __name__ == "__main__":
    main()

When I ran this code with these versions of python (3.9.10, nogil-3.9.10, 3.10.13, 3.11.8, 3.12.2) the maximum running time is ~13 seconds with normal 3.9.10, the minimum is ~5 seconds with nogil 3.9.10. When I run it with 3.13.0b1, the time skyrockets to ~48 seconds. I tried using cProfile to profile the code but it freezes and never outputs anything (with 3.13, with other versions it works), instead the cpu goes to 100% usage, which makes me think it doesn't use multiple cores, since nogil 3.9 goes to >600% usage, and never stops unless I kill the process.

The basic fibonacci test works like a charm, so I know the --disable-gil build succeded.

All of this is done on a Macbook Air M1 with 16 GB of RAM and 8 cpu cores.

CPython versions tested on:

3.9, 3.10, 3.11, 3.12, 3.13

Operating systems tested on:

macOS

Eclips4 commented 1 month ago

Duplicate of #118749

xbit18 commented 1 month ago

Doesn't seem like a duplicate to me. The version is different, he was using 3.13.0a6, mine's beta 1, and he had problems with the fibonacci script, which works ok for me. @Eclips4

colesbury commented 1 month ago

Yeah, you are going to encounter contention on the shared lists: both on the per-list locks and the reference count fields.

xbit18 commented 1 month ago

Yeah, you are going to encounter contention on the shared lists: both on the per-list locks and the reference count fields.

Ok so just to be clear: this is expected behavior due to the fact that the free threading implementation is still incomplete, or it would behave the same if it was fully implemented?

colesbury commented 1 month ago

This is the expected behavior -- it is not changing.

xbit18 commented 1 month ago

Ok thank you. Knowing this I changed the code so that it doesn't use a shared list "result" but thread-local results which are then combined. It doesn't really seem to be having any effect. Am I missing something?

Screen of different timings for the same code execution with different python versions (3.13 is free threading)

image

Also, I don't know how it can help but I've noticed that incrementing the number of threads seems to make thing worse. For example, using 2 threads I got ~20 seconds, using 8 I got 40 seconds and using 16 I got ~50 seconds. Screen of different timings for different number of threads specified (all with 3.13.0b1)

image

This is the updated code, as you can see it doesn't use shared lists anymore but every thread creates a local list which it returns and then all the lists are combined:

import time
import random
from concurrent.futures import ThreadPoolExecutor

def multiply_row(A, B, row_index, local_result):
    num_columns_B = len(B[0])
    num_columns_A = len(A[0])
    for j in range(num_columns_B):
        sum = 0
        for k in range(num_columns_A):
            sum += A[row_index][k] * B[k][j]
        local_result[row_index][j] = sum

def parallel_matrix_multiplication(a, b, start_row, end_row):
    local_result = [[0] * len(b[0]) for _ in range(len(a))]

    for row_index in range(start_row, end_row):
        multiply_row(a, b, row_index, local_result)

    return local_result

def multi_threaded_matrix_multiplication(a, b, num_threads):
    num_rows = len(a)
    result = []
    for _ in range(num_rows):
        result.append([0] * len(b[0]))
    row_chunk = num_rows // num_threads

    futures = []
    with ThreadPoolExecutor(max_workers=num_threads) as executor:
        for i in range(num_threads):
            start_row = i * row_chunk
            end_row = (i + 1) * row_chunk if i != num_threads - 1 else num_rows
            future = executor.submit(parallel_matrix_multiplication, a, b, start_row, end_row)
            futures.append(future)

    results = [future.result() for future in futures]

    # Combine local results into the final result matrix
    for local_result in results:
        for i in range(num_rows):
            for j in range(len(b[0])):
                result[i][j] += local_result[i][j]

    return result

# Helper function to create a random matrix
def create_random_matrix(rows, cols):
    return [[random.random() for _ in range(cols)] for _ in range(rows)]

def main():
    size = 500  # Define matrix size

    a = create_random_matrix(size, size)
    b = create_random_matrix(size, size)

    num_threads = 8 # Define number of threads

    start = time.perf_counter()

    result = multi_threaded_matrix_multiplication(a, b, num_threads)
    print("Matrix multiplication completed.", time.perf_counter() - start, "seconds.")

if __name__ == "__main__":
    main()
iperov commented 1 month ago

sorry guys,

where I can download JIT+noGIL build for windows for testing? i don't want to mess with the compilation

xbit18 commented 1 month ago

sorry guys,

where I can download JIT+noGIL build for windows for testing? i don't want to mess with the compilation

I believe you must download it from here https://www.python.org/downloads/release/python-3130b2/ and select the experimental features you want from the customization options during installation

iperov commented 1 month ago

@xbit18 any embeddable builds ?

xbit18 commented 1 month ago

@xbit18 any embeddable builds ?

I believe there is, just scroll through the options

mdboom commented 1 month ago

This is the updated code, as you can see it doesn't use shared lists anymore but every thread creates a local list which it returns and then all the lists are combined

It looks like the two random matrices (a and b) are still shared across threads. Even though it's read-only, that would still create lock contention, IIUC.

xbit18 commented 1 month ago

It looks like the two random matrices (a and b) are still shared across threads. Even though it's read-only, that would still create lock contention, IIUC.

You are completely right and I realised this half a second before reading your reply, I feel so stupid. I just had to do "a.copy()" in the executor.submit() call and the problem disappeared. The performance of the free threading build is still on par and not better than the non free threading ones but at least it's not worse by miles. I don't know if it matters but the GIL version of 3.13.0b1 performs worse than the other versions:

3.9.10
Matrix multiplication completed. 1.731310041 seconds.
nogil-3.9.10-1
Matrix multiplication completed. 0.590554667 seconds.
3.9.18
Matrix multiplication completed. 1.6376172500000001 seconds.
3.10.13
Matrix multiplication completed. 1.6486839579993102 seconds.
3.11.8
Matrix multiplication completed. 1.0865809580009227 seconds.
3.12.2
Matrix multiplication completed. 1.072277084000234 seconds.
3.13.0b1 without GIL
Matrix multiplication completed. 1.3272077920009906 seconds.
3.13.0b1 with GIL
Matrix multiplication completed. 2.8077587080006197 seconds.
xbit18 commented 1 month ago

Oh, something to mention: all the python versions which I'm comparing to 3.13 have been installed with pyenv using the "--with-lto" and "--enable-optimizations" options, while 3.13 hasn't because it wouldn't build with them. Perhaps it could be the cause of this?

mdboom commented 1 month ago

I feel so stupid.

This is not stupid! I apologize if I made you feel that way. Showing an interest in this stuff, and discussing it in the open is extremely valuable to making this all better, so thank you.

I don't know if it matters but the GIL version of 3.13.0b1 performs worse than the other versions.

This is an interesting finding, and something my team (the Faster CPython team) tracks pretty closely... but since we are mostly interested in single-threaded performance, we don't have any benchmarks that are CPU-bound, multithreaded-with-a-GIL like this. It's interesting to know there might be a regression there.

Oh, something to mention: all the python versions which I'm comparing to 3.13 have been installed with pyenv using the "--with-lto" and "--enable-optimizations" options, while 3.13 hasn't because it wouldn't build with them. Perhaps it could be the cause of this?

It could be, but the difference in this case seems much larger than I would expect from PGO/LTO. I'm definitely intrigued enough to look into it further, and I'll link back here if anything comes of it.

colesbury commented 1 month ago

I just had to do "a.copy()" in the executor.submit()

a is a list of lists of numbers. a.copy() makes a shallow copy. If you want to avoid reference count contention you need to copy the inner lists and the numbers themselves.

https://gist.github.com/colesbury/e2b0e050556da5cb57987d334df87203

xbit18 commented 1 month ago

This is not stupid! I apologize if I made you feel that way. Showing an interest in this stuff, and discussing it in the open is extremely valuable to making this all better, so thank you.

No absolutely you didn't, don't worry! I was joking because I've been struggling with this for like two days and the solution struck me like a brick because it was really obvious!

xbit18 commented 1 month ago

a is a list of lists of numbers. a.copy() makes a shallow copy. If you want to avoid reference count contention you need to copy the inner lists and the numbers themselves.

https://gist.github.com/colesbury/e2b0e050556da5cb57987d334df87203

Oh this is really helpful thank you! I'm running some other tests with the new code and it does actually seem to make a difference! For example these are some partial results with only 3 versions (size=1000 and threads=8):

3.11.8
Matrix multiplication completed. 72.75900166700012 seconds.
3.12.2
Matrix multiplication completed. 70.79678879200219 seconds.
3.13.0b1 without GIL
Matrix multiplication completed. 52.69710216600288 seconds.
3.13.0b1 with GIL
Matrix multiplication completed. 151.24306100000103 seconds.

It seems to be clear that the larger the matrix gets, the larger the impact of free threading is. Interestingly, as I said, the free-threading build WITH the GIL (-Xgil=1) does a lot worse.

mdboom commented 1 month ago

Interestingly, as I said, the free-threading build WITH the GIL (-Xgil=1) does a lot worse.

This is not surprising. When you have a free-threaded build and then run it with the GIL on (-Xgil=1), it has to check whether to use the GIL at runtime, in addition to adding a bunch of locks that aren't really strictly necessary when the GIL is turned on (unless I'm misunderstanding, and I might be). In other words, a free-threading build with the GIL turned on is intended to be correct, but I don't think it's intended to give the best performance. If you really want to do a performance measurement of 3.13 with the GIL on, you should use a non-free-threading build rather than the -Xgil=1 flag.

xbit18 commented 1 month ago

This is not surprising. When you have a free-threaded build and then run it with the GIL on (-Xgil=1), it has to check whether to use the GIL at runtime, in addition to adding a bunch of locks that aren't really strictly necessary when the GIL is turned on (unless I'm misunderstanding, and I might be). In other words, a free-threading build with the GIL turned on is intended to be correct, but I don't think it's intended to give the best performance. If you really want to do a performance measurement of 3.13 with the GIL on, you should use a non-free-threading build rather than the -Xgil=1 flag.

Thank you very much for this, I'll gladly check the non free threading version as well!