Heavy contention on blocking

jonhoo commented 4 years ago

Code that invokes spawn_blocking, or block_in_place (which internally calls spawn_blocking), all end up taking a shared lock here: https://github.com/tokio-rs/tokio/blob/221f421464e6672e9cf25629998477946a8b8e1f/tokio/src/runtime/blocking/pool.rs#L153

This causes a lot of unnecessary contention between unrelated tasks if you frequently need to run blocking code. The problem is exacerbated as load increases, because more calls to spawn_blocking causes each individual call to become more expensive, and each call holds up an executor thread.

dekellum commented 3 years ago

It looks to me like the lock is held for the minimum and correct span (e.g. not including actually running the blocking operation closure). In my own testing of an alternative thread pool (in blocking-permit crate), use of (parking_lot) mutex lock and condvar with minimized notifications appears to give the best possible performance, at least on linux.

What if anything is suggested as an alternative?

SUPERCILEX commented 2 years ago

The solution here would be to use a lockless queue, e.g. crossbeam_deque. @Darksonn On that note, why was crossbeam stuff purged from tokio? Every tokio runtime's injector uses locks which seems like a clear downgrade from a lockless queue.

Noah-Kennedy commented 2 years ago

I've seen this causing perf issues before on code which does a ton of file io.

SUPERCILEX commented 2 years ago

I've spent a bit of time thinking about what the optimal solution could look like:

Use an unbounded FIFO queue for Injector tasks (most likely SegQueue).
Use a bounded (by thread_cap) LIFO queue for idle threads (the point is to try and re-use the same threads). As far as I can tell, this doesn't exist (crossbeam_deque has an unbounded LIFO queue, but it uses a single buffer that needs to be reallocated on overflow which seems like a huge waste). This means we'd have to write our own (probably with a vec guarded by atomics).
Use a slab guarded by a mutex to keep track of alive threads. Not a fan of the mutex, so maybe use a shared slab that can double as the idle thread queue? Need to figure out how that would work.

Tasks are pushed to the SegQueue. Active threads poll the queue, twice if thread_cap has not been reached yet. If the second poll returns a value, start a new thread and give it that second task (putting it in the slab). If the queue is empty, the thread puts itself in the idle thread queue and parks itself with a timeout. New tasks can take a thread out of the idle queue and notify it (assuming it's not dead, otherwise remove from slab and loop). On notify or timeout, a thread checks to see if there's work and lets itself die if not, marking itself as dead. On shutdown, notify all the idle threads and then join everything in the slab.

Darksonn commented 2 years ago

We generally want to avoid adding new dependencies if we can at all avoid it.

hawkw commented 2 years ago

One nice thing about the injector queue is that --- unlike run queues --- tasks cannot be dropped from outside of the queue while in the injector queue. This means that we may be able to use an approach like Vyukov's intrusive singly-linked MPSC queue for injector queues --- this is something I've wanted to look into for a while, actually.

SUPERCILEX commented 2 years ago

Don't we need MPMC though?

tfreiberg-fastly commented 1 year ago

I've experienced this in benchmarks with high RPS doing very small reads from files on a ramdisk. Replacing spawn_blocking with block_in_place actually improved things in these cases (obviously, with larger reads block_in_place was unusable. this only confirmed that queueing is in fact an issue at 20k+ RPS) So i'm very interested in this issue and when it becomes a blocker, i'll try to contribute.

Noah-Kennedy commented 1 year ago

I've experienced this in benchmarks with high RPS doing very small reads from files on a ramdisk. Replacing spawn_blocking with block_in_place actually improved things in these cases (obviously, with larger reads block_in_place was unusable. this only confirmed that queueing is in fact an issue at 20k+ RPS) So i'm very interested in this issue and when it becomes a blocker, i'll try to contribute.

block_in_place actually does do basically a spawn_blocking call to obtain a thread for a new worker, so this is interesting.

tokio-rs / tokio

Heavy contention on blocking #2528