szalpal commented 4 months ago

Bug report

Bug description:

Hi :) This issue is essentially a re-open of https://github.com/python/cpython/issues/118153, but with some more experiments added.

I'm testing the free-threaded Python build. I'm running a simple test (code below), which triggers a computationally heavy function across CPU cores using threading module. Time measurements of the script are the following:

num_threads = 2  -> 0.68 s
num_threads = 8  -> 5.48 s
num_threads = 18 -> 12.1 s

In ideal world, I believe I could expect the three numbers above to be the same (or comparable). I've also gathered the profiles of the experiment:

num_threads = 2

num_threads = 8

num_threads = 18 (showing only some threads, but the picture illustrates the issue)

As we can see, the CPU utilisation decreases with the number of CPU threads used (almost 99% for nt=2, about 75% for nt=8 and ~40% for nt=18). We also see increased CPU core switching frequency. My guess is that the reason of decreased CPU utilisation is the overhead on the threading module.

Running some further experiments, I've run the program on a high number of threads (thus according to previous observation, the CPU utilisation should be low), but with both busy and idle wait on 10 threads and actual sin*cos computation on 2 threads. In both of these scenarios, we observe high CPU utilisation on worker threads:

idle wait (implemented with time.sleep)

busy wait (implemented with while loop)

Interestingly, zooming-in to be CPU utilisation profile (the "slow" case) we do see that there are parts in the timeline, where CPU is saturated and all threads are working in parallel. However, there are also periods, where CPU utilisation is scattered:

Lastly, as a sanity check, the same operation implemented in C++:

May this be a bug inside threading module? I've went through PEP 703, but I've seen no mention about this part. If the overhead on threading is the root cause of lowered utilisation, may this issue be addressed?

@colesbury , tagging you here since I believe you'd know most about the free-threaded Python build. Should this issue be added to the list in https://github.com/python/cpython/issues/108219?

Testing configuration:

Ubuntu 22.04
Python 3.13 ToT
scaling_governor - performance

CPU(s):                  36
  On-line CPU(s) list:   0-35
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz

CPython build command:

./configure --disable-gil --enable-optimizations && make -j && make install

Testing script:

import math
import time
import nvtx
import threading

def computational_heavy(iterations):
    val = 0
    for i in range(1, iterations):
        val += math.sin(i) * math.cos(i)
    return val

def test(thread_id, iterations=1000000):
    with nvtx.annotate("Calculation"):
        computational_heavy(iterations)

num_threads = 18

threads = [
    threading.Thread(target=test, name=f"Thread{i}", args=(i,))
    for i in range(num_threads)
]
start = time.perf_counter_ns()
for t in threads:
    t.start()
for t in threads:
    t.join()
stop = time.perf_counter_ns()
print(f"Elapsed time {stop-start} ns")

CPython versions tested on:

CPython main branch

Operating systems tested on:

Linux

szalpal commented 4 months ago

Referring to @JelleZijlstra 's https://github.com/python/cpython/issues/118153#issuecomment-2069871047:

Note that the free-threaded build is still experimental and not ready for production use. You shouldn't yet expect performance wins from it.

Yes, I am aware of that. My intention is to report the issue and possibly track the resolution of it.

colesbury commented 4 months ago

I ran this on main without the nvtx.annotate and I'm seeing

num_threads = 1 -> 0.34 s num_threads = 2 -> 0.36 s num_threads = 8 -> 0.43 s num_threads = 18 -> 0.44 s

And If I disable turbo I get:

num_threads = 1 -> 0.68 s num_threads = 18 -> 0.77 s

szalpal commented 4 months ago

Thank you for checking this. These are interesting results, however I couldn't get to this state on my end.

I've put together minimal repro (below). Did I miss some configuration flag for a free-threaded environment?

Setup:

CPU(s):                  16
  On-line CPU(s) list:   0-15
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 9124 16-Core Processor
scaling_governor: performance

Repro:

$ docker build -t python3.13-nogil .
$ docker run -it -v $(pwd):/test -w /test python3.13-nogil python3 --version && python3 test.py
Python 3.13.0a4
nt=4     Elapsed time 394468897 ns
nt=14    Elapsed time 1399073578 ns
nt=4     Elapsed time 397696421 ns
nt=14    Elapsed time 1401378897 ns
nt=4     Elapsed time 396369177 ns
nt=14    Elapsed time 1389147085 ns

Dockerfile:

FROM ubuntu:22.04 as base

ARG DEBIAN_FRONTEND=noninteractive
ENV PYTHON_GIL=0

RUN apt-get update && \
    apt-get install -y wget autoconf autogen git pkg-config libbz2-dev liblzma-dev libgdbm-dev     \
                       libsqlite3-dev uuid-dev libreadline-dev libssl-dev zlib1g-dev               \
                       libncurses5-dev libncursesw5-dev libffi-dev libdb-dev build-essential

WORKDIR /opt

RUN git clone --recursive -j"$(grep ^processor /proc/cpuinfo | wc -l)" https://github.com/python/cpython.git && cd cpython && \
    git checkout v3.13.0a4 && \
    mkdir build && cd build && ../configure --disable-gil --enable-optimizations && \
    make -j"$(grep ^processor /proc/cpuinfo | wc -l)" && make install

test.py:

import math
import time
import threading

def computational_heavy(iterations):
    val = 0
    sin = math.sin
    cos = math.cos
    for i in range(1, iterations):
        val += sin(i) * cos(i)
    return val

def test(thread_id, iterations=1000000):
    computational_heavy(iterations)

num_threads = [4, 14, 4, 14, 4, 14]

for nt in num_threads:
    threads = [
        threading.Thread(target=test, name=f"Thread{i}", args=(i,))
        for i in range(nt)
    ]
    start = time.perf_counter_ns()
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    stop = time.perf_counter_ns()
    print(f"{nt=}. Elapsed time {stop-start} ns")

colesbury commented 4 months ago

You are using Python 3.13a4 (from February). Many of the scaling bottlenecks were addressed in the last week or two. You need to use Python 3.13b1 or newer.

szalpal commented 4 months ago

Thank you for pointing this out. I've changed the environment to the beta release (had to add -i flag to make, since some tests fail), but unfortunately there's still difference in performance. Do you believe I missed something else in my environment configuration?

Dockerfile:

FROM ubuntu:22.04 as base

ARG DEBIAN_FRONTEND=noninteractive
ENV PYTHON_GIL=0

RUN apt-get update && \
    apt-get install -y wget autoconf autogen git pkg-config libbz2-dev liblzma-dev libgdbm-dev     \
                       libsqlite3-dev uuid-dev libreadline-dev libssl-dev zlib1g-dev               \
                       libncurses5-dev libncursesw5-dev libffi-dev libdb-dev build-essential

WORKDIR /opt

# Build and install the latest CPython
RUN git clone --recursive -j"$(grep ^processor /proc/cpuinfo | wc -l)" https://github.com/python/cpython.git && cd cpython && \
    git checkout v3.13.0b1 && \
    mkdir build && cd build && ../configure --disable-gil --enable-optimizations && \
    make -j"$(grep ^processor /proc/cpuinfo | wc -l)" -i && make install

$ docker run -it -v $(pwd):/test -w /test python3.13-nogil python3 --version && python3 test.py
Python 3.13.0b1
nt=4     Elapsed time 468306427 ns
nt=14    Elapsed time 1671326595 ns
nt=4     Elapsed time 480656613 ns
nt=14    Elapsed time 1644104810 ns
nt=4     Elapsed time 470198085 ns
nt=14    Elapsed time 1655974063 ns

colesbury commented 4 months ago

Did you try it without the nvtx.annotate?

szalpal commented 4 months ago

Yes, the minimal repro I've provided above contains only Python (not even numpy there).

Attaching it once again here, just to keep everything in one place:

Dockerfile

FROM ubuntu:22.04 as base

ARG DEBIAN_FRONTEND=noninteractive
ENV PYTHON_GIL=0

RUN apt-get update && \
    apt-get install -y wget autoconf autogen git pkg-config libbz2-dev liblzma-dev libgdbm-dev     \
                       libsqlite3-dev uuid-dev libreadline-dev libssl-dev zlib1g-dev               \
                       libncurses5-dev libncursesw5-dev libffi-dev libdb-dev build-essential

WORKDIR /opt

# Build and install the latest CPython
RUN git clone --recursive -j"$(grep ^processor /proc/cpuinfo | wc -l)" https://github.com/python/cpython.git && cd cpython && \
    git checkout v3.13.0b1 && \
    mkdir build && cd build && ../configure --disable-gil --enable-optimizations && \
    make -j"$(grep ^processor /proc/cpuinfo | wc -l)" -i && make install

test.py

import math
import time
import threading

def computational_heavy(iterations):
    val = 0
    sin = math.sin
    cos = math.cos
    for i in range(1, iterations):
        val += sin(i) * cos(i)
    return val

def test(thread_id, iterations=1000000):
    computational_heavy(iterations)

num_threads = [4, 14, 4, 14, 4, 14]

for nt in num_threads:
    threads = [
        threading.Thread(target=test, name=f"Thread{i}", args=(i,))
        for i in range(nt)
    ]
    start = time.perf_counter_ns()
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    stop = time.perf_counter_ns()
    print(f"{nt=}. Elapsed time {stop-start} ns")

$ docker build -t python3.13-nogil .
$ docker run -it -v $(pwd):/test -w /test python3.13-nogil python3 --version && python3 test.py
Python 3.13.0b1
nt=4     Elapsed time 468306427 ns
nt=14    Elapsed time 1671326595 ns
nt=4     Elapsed time 480656613 ns
nt=14    Elapsed time 1644104810 ns
nt=4     Elapsed time 470198085 ns
nt=14    Elapsed time 1655974063 ns

python / cpython

CPU utilisation drop when increasing number of threads with `threading` #118649

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on: