Avoid releasing the GIL in nonblocking socket operations

2f5601e0-9cbf-47ae-a123-6b01e62b2a28 commented 2 years ago

BPO	45819
Nosy	@rhettinger, @tiran, @asvetlov, @1st1, @jakirkham, @jcrist
PRs	python/cpython#29579
Files	bench.py

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.11', 'expert-asyncio', 'expert-IO', 'performance'] title = 'Avoid releasing the GIL in nonblocking socket operations' updated_at = user = 'https://github.com/jcrist' ``` bugs.python.org fields: ```python activity = actor = 'jakirkham' assignee = 'none' closed = False closed_date = None closer = None components = ['IO', 'asyncio'] creation = creator = 'jcristharif' dependencies = [] files = ['50443'] hgrepos = [] issue_num = 45819 keywords = ['patch'] message_count = 3.0 messages = ['406422', '406431', '406436'] nosy_count = 6.0 nosy_names = ['rhettinger', 'christian.heimes', 'asvetlov', 'yselivanov', 'jakirkham', 'jcristharif'] pr_nums = ['29579'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'performance' url = 'https://bugs.python.org/issue45819' versions = ['Python 3.11'] ```

2f5601e0-9cbf-47ae-a123-6b01e62b2a28 commented 2 years ago

In https://bugs.python.org/issue7946 an issue with how the current GIL interacts with mixing IO and CPU bound work. Quoting this issue:

when an I/O bound thread executes an I/O call, it always releases the GIL. Since the GIL is released, a CPU bound thread is now free to acquire the GIL and run. However, if the I/O call completes immediately (which is common), the I/O bound thread immediately stalls upon return from the system call. To get the GIL back, it now has to go through the timeout process to force the CPU-bound thread to release the GIL again.

This issue can come up in any application where IO and CPU bound work are mixed (we've found it to be a cause of performance issues in https://dask.org for example). Fixing the general problem is tricky and likely requires changes to the GIL's internals, but in the specific case of mixing asyncio running in one thread and CPU work happening in background threads, there may be a simpler fix - don't release the GIL if we don't have to.

Asyncio relies on nonblocking socket operations, which by definition shouldn't block. As such, releasing the GIL shouldn't be needed for many operations (send, recv, ...) on socket.socket objects provided they're in nonblocking mode (as suggested in https://bugs.python.org/issue7946#msg99477). Likewise, dropping the GIL can be avoided when calling select on selectors.BaseSelector objects with a timeout of 0 (making it a non-blocking call).

I've made a patch (https://github.com/jcrist/cpython/tree/keep-gil-for-fast-syscalls) with these two changes, and run a benchmark (attached) to evaluate the effect of background threads with/without the patch. The benchmark starts an asyncio server in one process, and a number of clients in a separate process. A number of background threads that just spin are started in the server process (configurable by the -t flag, defaults to 0), then the server is loaded to measure the RPS.

Here are the results:

# Main branch
$ python bench.py -c1 -t0
Benchmark: clients = 1, msg-size = 100, background-threads = 0
16324.2 RPS
$ python bench.py -c1 -t1
Benchmark: clients = 1, msg-size = 100, background-threads = 1
Spinner spun 1.52e+07 cycles/second
97.6 RPS
$ python bench.py -c2 -t0
Benchmark: clients = 2, msg-size = 100, background-threads = 0
31308.0 RPS
$ python bench.py -c2 -t1
Benchmark: clients = 2, msg-size = 100, background-threads = 1
Spinner spun 1.52e+07 cycles/second
96.2 RPS
$ python bench.py -c10 -t0
Benchmark: clients = 10, msg-size = 100, background-threads = 0
47169.6 RPS
$ python bench.py -c10 -t1
Benchmark: clients = 10, msg-size = 100, background-threads = 1
Spinner spun 1.54e+07 cycles/second
95.4 RPS

# With this patch
$ ./python bench.py -c1 -t0
Benchmark: clients = 1, msg-size = 100, background-threads = 0
18201.8 RPS
$ ./python bench.py -c1 -t1
Benchmark: clients = 1, msg-size = 100, background-threads = 1
Spinner spun 9.03e+06 cycles/second
194.6 RPS
$ ./python bench.py -c2 -t0
Benchmark: clients = 2, msg-size = 100, background-threads = 0
34151.8 RPS
$ ./python bench.py -c2 -t1
Benchmark: clients = 2, msg-size = 100, background-threads = 1
Spinner spun 8.72e+06 cycles/second
729.6 RPS
$ ./python bench.py -c10 -t0
Benchmark: clients = 10, msg-size = 100, background-threads = 0
53666.6 RPS
$ ./python bench.py -c10 -t1
Benchmark: clients = 10, msg-size = 100, background-threads = 1
Spinner spun 5e+06 cycles/second
21838.2 RPS

A few comments on the results:

On the main branch, any GIL contention sharply decreases the number of RPS an asyncio server can handle, regardless of the number of clients. This makes sense - any socket operation will release the GIL, and the server thread will have to wait to reacquire it (up to the switch interval), rinse and repeat. So if every request requires 1 recv and 1 send, a server with background GIL contention is stuck at a max of 1 / (2 * switchinterval) or 200 RPS with default configuration. This effectively prioritizes the background thread over the IO thread, since the IO thread releases the GIL very frequently and the background thread never does.
With the patch, we still see a performance degradation, but the degradation is less severe and improves with the number of clients. This is because with these changes the asyncio thread only releases the GIL when doing a blocking poll for new IO events (or when the switch interval is hit). With low load (1 client), the IO thread becomes idle more frequently and releases the GIL. Under higher load though the event loop frequently still has work to do at the end of a cycle and issues a selector.select call with a 0 timeout (nonblocking), avoiding releasing the GIL at all during that loop (note the nonlinear effect of adding more clients). Since the IO thread still releases the GIL sometimes, the background thread still holds the GIL a larger percentage of the time than the IO thread, but the difference between them is less severe than without this patch.

I have also tested this patch on a Dask cluster running some real-world problems and found that it did improve performance where IO was throttled due to GIL contention.

rhettinger commented 2 years ago

+1 There is almost no upside for the current behavior.

tiran commented 2 years ago

Do POSIX and Windows APIs guarantee that operations on a nonblocking socket can never ever block?

We cannot safely keep the GIL unless you can turn "nonblocking socket operations by definition shouldn't block" into "standard guarantees that nonblocking socket operations never block".

gvanrossum commented 1 year ago

It does seem interesting. Maybe this idea can be implemented in a 3rd party library like uvloop to see if there are hidden catches? @1st1

python / cpython

Avoid releasing the GIL in nonblocking socket operations #89977