Closed wyfo closed 4 months ago
Thanks for the proposal.
Is the performance improvement from algorithm changes or the addition of fast-path that doesn't call CAS/RMW or both? Analyzing them would be helpful.
In particular, if the second, it can be adopted into our code without the risk of affecting the correctness of the code.
My implementation was obviously broken, and not testing it with loom before posting this issue was quite dumb from me.
Currently,
AtomicWaker::take
does a lot of work for nothing when there is no registered waker. It causes performance issue in a contended environment where the function is called a lot.I've designed a new algorithm to optimize this function, and compared the performance. A simple micro-benchmarks give a huge x100 improvement:
futures::task::AtomicWaker
6.585227mscrate::AtomicWaker
65.72µsBenchmark code:
These results can easily be explained: no waker registered means a single
AtomicU8::load
with the new algorithm. The function is less than 20 assembly instructions (on my computer), so I've added#[inline]
, but even without inlining, it's still around 100µs, way faster than the current implementation.For the context, I'm developping an MPSC queue, where each enqueuing operation notify the consumer.
AtomicWaker
seems to fit for the consumer, but, as stated above, the performance impact is huge. In fact, I can't even useAtomicWaker
because my code also support synchronous blocking operations, see my proposal https://internals.rust-lang.org/t/thread-park-waker-a-waker-calling-thread-unpark/19114. But, as I've implemented this new algorithm (generic in my case, to handleThread::unpark
), I thought it would be a good idea to propose it forAtomicWaker
optimization.Disclaimer: This algorithm may be bugged, memory order may be improvable, I'm not an expert, and I have not take the time for now to test it with loom. However, I've already tested in "real" situation, in my queue benchmarks, and it seems to work well.
EDIT: The original implementation is here