Worker pool for efficient handling of `blocking_operation_wait`.

See https://github.com/socketry/async/pull/352 for context.

The cost of creating threads and the usage of nogvl creates inefficiencies.

rb_nogvl should really only be used when the amount of work to be done is greater than some scheduling quantum. In practice, that's hard to achieve, so we also want to minimise the overhead of blocking_operation_wait (later abbreviated BOW).

I've been using async-cable as a benchmark as it has a good mixture of IO and nogvl (inflate/deflate) operations. Those operations are typically extremely small, so the overhead is revealed greatly. Another benchmark is the recently introduced IO::Buffer#copy using nogvl on large buffers. It is the opposite - highly CPU bound (memory bound actually) work with little IO.

Semantics remain unchanged.

`Async::Cable` Benchmarks

This benchmark is mostly network bound and there are a lot of small calls to inflate/deflate which uses rb_nogvl:

Configuration	Connection Time	Message Time
No BOW	0.67ms	0.024ms
Thread BOW	2.2ms	0.96ms
Work Pool BOW	0.83ms	0.045ms

Overall, we can see a net loss in performance by offloading rb_nogvl with a pure Ruby implementation. I believe we can attribute this to the offloading thread having to re-acquire the GVL which creates unnecessary contention. This is fixable but requires a native code path (probably in the IO::Event scheduler implementation.

`IO::Buffer` Benchmarks

This benchmark is more memory bound and there is essentially zero blocking IO:

Configuration	Task Count	Buffer Size	Duration	Throughput
No BOW	1	100MiB	6.46ms	15GB/s
No BOW	8	100MiB	28.93ms	27GB/s
No BOW	16	100MiB	56.81ms	28GB/s
Thread BOW	1	100MiB	6.99ms	14GB/s
Thread BOW	8	100MiB	24.52ms	32GB/s
Thread BOW	16	100MiB	44.33ms	36GB/s
Work Pool BOW	1	100MiB	7.12ms	14GB/s
Work Pool BOW	8	100MiB	20.41ms	39GB/s
Work Pool BOW	16	100MiB	43.53ms	36GB/s

Overall, the thread and work pool are similar. I believe we see GVL contention even on the background threads as I'd expect the numbers to be a little more linear, although it's also true the memory bandwidth isn't unlimited.

Types of Changes

Performance improvement.

Contribution

[x] I added tests for my changes.
[x] I tested my changes locally.
[x] I agree to the Developer's Certificate of Origin 1.1.

socketry / async