rust-lang / futures-rs

Zero-cost asynchronous programming in Rust
https://rust-lang.github.io/futures-rs/
Apache License 2.0
5.34k stars 616 forks source link

`AtomicWaker::take` performance issue #2760

Closed wyfo closed 1 month ago

wyfo commented 1 year ago

Currently, AtomicWaker::take does a lot of work for nothing when there is no registered waker. It causes performance issue in a contended environment where the function is called a lot.

I've designed a new algorithm to optimize this function, and compared the performance. A simple micro-benchmarks give a huge x100 improvement:

Benchmark code:

let waker = AtomicWaker::new(); // no registered waker
let start = std::Instant::now();
(0..100000).into_par_iter().for_each(|_| waker.wake()); // use rayon for contention
println!("{:?}", start.elapsed());

These results can easily be explained: no waker registered means a single AtomicU8::load with the new algorithm. The function is less than 20 assembly instructions (on my computer), so I've added #[inline], but even without inlining, it's still around 100µs, way faster than the current implementation.

For the context, I'm developping an MPSC queue, where each enqueuing operation notify the consumer. AtomicWaker seems to fit for the consumer, but, as stated above, the performance impact is huge. In fact, I can't even use AtomicWaker because my code also support synchronous blocking operations, see my proposal https://internals.rust-lang.org/t/thread-park-waker-a-waker-calling-thread-unpark/19114. But, as I've implemented this new algorithm (generic in my case, to handle Thread::unpark), I thought it would be a good idea to propose it for AtomicWaker optimization.

Disclaimer: This algorithm may be bugged, memory order may be improvable, I'm not an expert, and I have not take the time for now to test it with loom. However, I've already tested in "real" situation, in my queue benchmarks, and it seems to work well.

EDIT: The original implementation is here

taiki-e commented 1 year ago

Thanks for the proposal.

Is the performance improvement from algorithm changes or the addition of fast-path that doesn't call CAS/RMW or both? Analyzing them would be helpful.

In particular, if the second, it can be adopted into our code without the risk of affecting the correctness of the code.

wyfo commented 1 month ago

My implementation was obviously broken, and not testing it with loom before posting this issue was quite dumb from me.