Latency profile of broadcast vs mpmc

jonathanstrong commented 6 years ago

In latency benchmarks for scenarios with many-producers, and 4-12 consumers, I found the broadcast queue to have much worse performance at the tail end than separate mpmc queues that producers send to consecutively.

This "parallel mpmc" pattern works like this (rough example):

let mut receivers = Vec::new();
let mut senders = Vec::new();

for _ in 0..N_CONSUMERS {
    let (tx, rx) = mpmc_queue(QUEUE_SIZE);
    senders.push(tx);
    receivers.push(rx);
}

for _ in 0..N_PRODUCERS {
    let my_senders = senders.iter().map(|tx: MPMCSender<_>| tx.clone());
    // launch thread...
}

// later, in producer thread...
let item = // ...
for tx in &my_senders {
    tx.try_send(item.clone()).unwrap();
}

// in consumer thread
for msg in rx.try_iter() {
    // act on messages...
}

Broadcast followed the doc examples, including calling unsubscribe on the unused BroadcastReceiver, but consumers are reading a new_stream from the same queue instead of their own personal queue.

In a 40 producer, 6 consumer test, the two setups have nearly identical latency profiles up to the 99.9% lie. But after that, the broadcast queue fares much worse:

           broadcast   par mpmc
           ---------------------
99.99%     307u        85u
99.999%    4.7ms       593u
99.9999%   8.7ms       600u*

* one 9ms outlier

Caveats:

I'm fairly new to measuring latency, I could have made an error. The way latency was calculated was by sending a time over the queue/channel and comparing it to the current time when the consumer received it. The delta in nanoseconds (or 0, if measured duration was < 0ns) was recorded with a histogram from the hdrhistogram crate.
I was specifically trying to measure infrequent, sporadic sends. In between sends, the producer threads would sleep for a random duration between 1u and 500u. Total messages sent over the lifetime of the tests above was round 50k.
The main system I ran the tests on was not tuned for latency, and I was not careful about minimizing activity on the server during the tests. However on other machines, including some with realtime kernels and without other activity, showed similar results.

Questions

Do you have any thoughts on why this discrepancy exists?
Did the design anticipate dozens of producer handles/threads? Would that cause any problems you foresee?
Do you have any suggestions for minimizing worst case latency for a many-producer, low-throughput scenario?

Thanks!

schets commented 6 years ago

Can you post the entire code? This is somewhat expected on an oversubscribed system due to the notification method differences that must happen between broadcast and mpmc.

The broadcast is mostly built for systems where the consumers have dedicated cores and are busy waiting

jonathanstrong commented 6 years ago

Yes, will do. It will take a day or so since need to pull out some proprietary code.

In the meantime, I have done a lot more testing, and it's fair to say results are all over the map.

These are results for the same two approaches, but on a VPS w/ 4 "dedicated" cores (8 w/ hyperthreads), lowlatency kernel, isolcpus=4,5,6,7, and two consumers pinned to the two isolated cores:

So, slightly better performance from broadcast.

Both had cataclysmic interrupts on one occasion over two hours (3ms, 9ms), which aren't included. I selected a calm hour for each (I figure the system is at fault there).

One big change was initially I had the queue size at 4096. I read in an LMAX faq that smaller size allows the entire buffer to stay in cache, and the results above are for a queue size of 8. It might be a good idea to include some guidelines in the docs on this. I think your examples are for size 4, but it didn't occur to me at the time I should really set it in that range.

But the biggest takeaway is these tests are very sensitive to the machine they are being run on. It's pretty finicky. More tests = more doubt. Would be interested in any input you can provide on best practices. I'll post the test code when I can untangle it a bit. Thanks!

schets commented 6 years ago

One big change was initially I had the queue size at 4096. I read in an LMAX faq that smaller size allows the entire buffer to stay in cache, and the results above are for a queue size of 8. It might be a good idea to include some guidelines in the docs on this. I think your examples are for size 4, but it didn't occur to me at the time I should really set it in that range.

I think this helps explain it! The thing with the broadcast queue is that no writer can ever write past a fallen-behind subscriber (the existence of destructors makes that trickier). With a small queue size, that makes it much easier for a reader to fall behind and block writers.

jonathanstrong commented 6 years ago

here is a working version of the benchmark code: https://gist.github.com/jonathanstrong/d88d474f5b0ea147e5bc98d9df0910a2

ProfFan commented 6 years ago

@jonathanstrong Any chance you can run the latency benchmarks on a system with a -lowlatency kernel? I think that would help eliminate the spikes and add a bit more reliability to the tests.

jonathanstrong commented 6 years ago

@ProfFan apologies, should have mentioned it: tests were run on the lowlatency kernel, Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-128-lowlatency x86_64). If there are any other system-level improvements you would suggest, would definitely be interested, good information is hard to find.

ProfFan commented 6 years ago

@jonathanstrong Debian has a linux-image-rt which is the kernel patched with PREEMPT-RT patches. (See https://packages.debian.org/stretch/linux-image-rt-amd64)

I think this would be pretty interesting especially for robotics and embedded developers (like me 😄 )

schets / multiqueue

Latency profile of broadcast vs mpmc #27

Caveats:

Questions