Closed chkothe closed 2 years ago
@chkothe I know you wanted to do a little more work on this. No rush.
I just wanted to bring to our attention a user reporting CPU utilization going up 10x and I think it is because the outlet buffer is overflowing. Let's make sure to test performance in the outlet-buffer-full case.
Yes, planning to do some more polish & account for some of the suggested changes once I'm out of my current sprint (hopefully over Christmas).
For the buffer overflow case: we need to probably emit some one-time (or periodically repeated log warning that there's a buffer overflow, since in that case the user is going to have data loss and significant transmit delay). Basically we'd want to recognize that as not a normal (steady-state) operating condition but more like a "you have a big problem with your data transmission" kind of situation.
FYI, I've rebased the branch and pushed it here. I'll see that I'll finish the review and some benchmarks so we can get this merged for 1.16
Looks good to me so far, the remaining comments aren't critical and can be addressed after it's merged.
I found one optimization for add_wrap(x, 1)
that produces more compact assembly even with -O3
:
#include <cstdint>
using T = std::size_t;
extern const T wrap_at_;
inline T _add_wrap(T x, T delta) {
const T xp = x + delta;
return xp >= wrap_at_ ? xp - wrap_at_ : xp;
}
T add1_wrap(T x) {
return ++x == wrap_at_ ? 0 : x;
}
T add_wrap(T x) {
return _add_wrap(x, 1);
}
x86_64:
add1_wrap(unsigned long): # @add1_wrap(unsigned long)
add rdi, 1
xor eax, eax
cmp rdi, qword ptr [rip + wrap_at_]
cmovne rax, rdi
ret
add_wrap(unsigned long): # @add_wrap(unsigned long)
mov rax, rdi
add rax, 1
mov rcx, qword ptr [rip + wrap_at_]
xor edx, edx
cmp rax, rcx
cmovae rdx, rcx
sub rax, rdx
ret
ARM:
add1_wrap(unsigned long):
adrp x1, wrap_at_
add x0, x0, 1
ldr x1, [x1, #:lo12:wrap_at_]
cmp x1, x0
csel x0, x0, xzr, ne
ret
add_wrap(unsigned long):
adrp x1, wrap_at_
add x0, x0, 1
ldr x1, [x1, #:lo12:wrap_at_]
subs x1, x0, x1 # avoided in add1_wrap
csel x0, x1, x0, cs
ret
I have removed the FORCEINLINE
macro in 2a3cd09aea8b908a78e27dc3a2e7a527e308abfe because 1) it's only used once and will be inlined even without any inline
attribute and 2) it causes linker errors with MinGW.
LIKELY
and UNLIKELY
are also good candidates for trouble and we already have BOOST_LIKELY
and BOOST_LIKELY
, so we might as well use those.
Effectively merged in #135
@chkothe -- Should #71 be closed too?
Prompted by jfrey-xx's recent work on the
consumer_queue
, which revealed a concurrency issue with the spsc queue at buffer overflow, and which traded lower CPU load at low sampling rates for higher load at high rates, I’ve spent a bit of time resolving an issue underlying that queue, so we can have the best of both worlds (high throughput and low CPU use). Which is that, in our scenario, there’s an edge case where the producer becomes a second consumer, and thus we technically need an SPMC queue not an SPSC queue.In the current implementation this is addressed by some additional locks to protect against that scenario, but unfortunately it means that producer and consumer cannot make progress at the same time (i.e., they stall each other). This is only revealed by 2-thread benchmarks, which I believe we don’t have in the benchmark suite yet (I added a benchmark script to my fork that can measure that).
I’ve looked around a bit to find an SPMC queue that would get the job done, and it turns out that there’s a pretty frequently used ring buffer scheme for that (used among others in xenium, a queue by Rigtorp used in Frostbite, and a few other pieces of software, and an original implementation by Vyukov, all of which are in fact MPMC). It’s actually faster than boost’s
spsc_queue
(because it doesn’t need to synchronize between reader and writer at all, unless they act on the same queue element). So I’ve adapted that for our consumer_queue (try_push
/try_pop
), in a simplified SPMC version, since we only need one producer (that gives a bit of extra performance).I’ve done a bit of benchmarking, and on my machine (Ubuntu laptop) it comes out at >60% higher max end-to-end throughput between a
push_sample
andpull_sample
thread relative to the current lock-based variant (~4.8 MHz instead of ~2.9MHz). With these changes, there’s only one (rarely contended and thus quite fast) lock left around thecv.notify_one()
for use with a wait timeout, but since now it's faster than even the old lockfree variant, we should be ok for a while. (One can use optional locking to make the queue again fully lock-free & wait-free except when it’s full or empty, but I'd need to do a bit more testing on that first).Other than that I’ve done a decent amount of testing, but of course we’ll want to let it simmer a bit and maybe write a few more unit tests before we can pass that on to users.