socketry / async

An awesome asynchronous event-driven reactor for Ruby.
MIT License
2.16k stars 90 forks source link

Worker pool for efficient handling of `blocking_operation_wait`. #359

Closed ioquatix closed 5 days ago

ioquatix commented 1 week ago

See https://github.com/socketry/async/pull/352 for context.

The cost of creating threads and the usage of nogvl creates inefficiencies.

rb_nogvl should really only be used when the amount of work to be done is greater than some scheduling quantum. In practice, that's hard to achieve, so we also want to minimise the overhead of blocking_operation_wait (later abbreviated BOW).

I've been using async-cable as a benchmark as it has a good mixture of IO and nogvl (inflate/deflate) operations. Those operations are typically extremely small, so the overhead is revealed greatly. Another benchmark is the recently introduced IO::Buffer#copy using nogvl on large buffers. It is the opposite - highly CPU bound (memory bound actually) work with little IO.

Semantics remain unchanged.

Async::Cable Benchmarks

This benchmark is mostly network bound and there are a lot of small calls to inflate/deflate which uses rb_nogvl:

Configuration Connection Time Message Time
No BOW 0.67ms 0.024ms
Thread BOW 2.2ms 0.96ms
Work Pool BOW 0.83ms 0.045ms

Overall, we can see a net loss in performance by offloading rb_nogvl with a pure Ruby implementation. I believe we can attribute this to the offloading thread having to re-acquire the GVL which creates unnecessary contention. This is fixable but requires a native code path (probably in the IO::Event scheduler implementation.

IO::Buffer Benchmarks

This benchmark is more memory bound and there is essentially zero blocking IO:

Configuration Task Count Buffer Size Duration Throughput
No BOW 1 100MiB 6.46ms 15GB/s
No BOW 8 100MiB 28.93ms 27GB/s
No BOW 16 100MiB 56.81ms 28GB/s
Thread BOW 1 100MiB 6.99ms 14GB/s
Thread BOW 8 100MiB 24.52ms 32GB/s
Thread BOW 16 100MiB 44.33ms 36GB/s
Work Pool BOW 1 100MiB 7.12ms 14GB/s
Work Pool BOW 8 100MiB 20.41ms 39GB/s
Work Pool BOW 16 100MiB 43.53ms 36GB/s

Overall, the thread and work pool are similar. I believe we see GVL contention even on the background threads as I'd expect the numbers to be a little more linear, although it's also true the memory bandwidth isn't unlimited.

Types of Changes

Contribution