schets / multiqueue

A fast mpmc queue with broadcast capabilities
MIT License
200 stars 29 forks source link

Latency profile of broadcast vs mpmc #27

Open jonathanstrong opened 6 years ago

jonathanstrong commented 6 years ago

In latency benchmarks for scenarios with many-producers, and 4-12 consumers, I found the broadcast queue to have much worse performance at the tail end than separate mpmc queues that producers send to consecutively.

This "parallel mpmc" pattern works like this (rough example):

let mut receivers = Vec::new();
let mut senders = Vec::new();

for _ in 0..N_CONSUMERS {
    let (tx, rx) = mpmc_queue(QUEUE_SIZE);
    senders.push(tx);
    receivers.push(rx);
}

for _ in 0..N_PRODUCERS {
    let my_senders = senders.iter().map(|tx: MPMCSender<_>| tx.clone());
    // launch thread...
}

// later, in producer thread...
let item = // ...
for tx in &my_senders {
    tx.try_send(item.clone()).unwrap();
}

// in consumer thread
for msg in rx.try_iter() {
    // act on messages...
}

Broadcast followed the doc examples, including calling unsubscribe on the unused BroadcastReceiver, but consumers are reading a new_stream from the same queue instead of their own personal queue.

In a 40 producer, 6 consumer test, the two setups have nearly identical latency profiles up to the 99.9% lie. But after that, the broadcast queue fares much worse:

           broadcast   par mpmc
           ---------------------
99.99%     307u        85u
99.999%    4.7ms       593u
99.9999%   8.7ms       600u*

* one 9ms outlier

image

image

Caveats:

Questions

Thanks!

schets commented 6 years ago

Can you post the entire code? This is somewhat expected on an oversubscribed system due to the notification method differences that must happen between broadcast and mpmc.

The broadcast is mostly built for systems where the consumers have dedicated cores and are busy waiting

jonathanstrong commented 6 years ago

Yes, will do. It will take a day or so since need to pull out some proprietary code.

In the meantime, I have done a lot more testing, and it's fair to say results are all over the map.

These are results for the same two approaches, but on a VPS w/ 4 "dedicated" cores (8 w/ hyperthreads), lowlatency kernel, isolcpus=4,5,6,7, and two consumers pinned to the two isolated cores:

image

image

So, slightly better performance from broadcast.

Both had cataclysmic interrupts on one occasion over two hours (3ms, 9ms), which aren't included. I selected a calm hour for each (I figure the system is at fault there).

One big change was initially I had the queue size at 4096. I read in an LMAX faq that smaller size allows the entire buffer to stay in cache, and the results above are for a queue size of 8. It might be a good idea to include some guidelines in the docs on this. I think your examples are for size 4, but it didn't occur to me at the time I should really set it in that range.

But the biggest takeaway is these tests are very sensitive to the machine they are being run on. It's pretty finicky. More tests = more doubt. Would be interested in any input you can provide on best practices. I'll post the test code when I can untangle it a bit. Thanks!

schets commented 6 years ago

One big change was initially I had the queue size at 4096. I read in an LMAX faq that smaller size allows the entire buffer to stay in cache, and the results above are for a queue size of 8. It might be a good idea to include some guidelines in the docs on this. I think your examples are for size 4, but it didn't occur to me at the time I should really set it in that range.

I think this helps explain it! The thing with the broadcast queue is that no writer can ever write past a fallen-behind subscriber (the existence of destructors makes that trickier). With a small queue size, that makes it much easier for a reader to fall behind and block writers.

jonathanstrong commented 6 years ago

here is a working version of the benchmark code: https://gist.github.com/jonathanstrong/d88d474f5b0ea147e5bc98d9df0910a2

ProfFan commented 6 years ago

@jonathanstrong Any chance you can run the latency benchmarks on a system with a -lowlatency kernel? I think that would help eliminate the spikes and add a bit more reliability to the tests.

jonathanstrong commented 6 years ago

@ProfFan apologies, should have mentioned it: tests were run on the lowlatency kernel, Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-128-lowlatency x86_64). If there are any other system-level improvements you would suggest, would definitely be interested, good information is hard to find.

ProfFan commented 6 years ago

@jonathanstrong Debian has a linux-image-rt which is the kernel patched with PREEMPT-RT patches. (See https://packages.debian.org/stretch/linux-image-rt-amd64)

I think this would be pretty interesting especially for robotics and embedded developers (like me 😄 )