[Post-1.14] Replace boost::spsc_queue by a fast SPMC queue

chkothe commented 3 years ago

Prompted by jfrey-xx's recent work on the consumer_queue, which revealed a concurrency issue with the spsc queue at buffer overflow, and which traded lower CPU load at low sampling rates for higher load at high rates, I’ve spent a bit of time resolving an issue underlying that queue, so we can have the best of both worlds (high throughput and low CPU use). Which is that, in our scenario, there’s an edge case where the producer becomes a second consumer, and thus we technically need an SPMC queue not an SPSC queue.

In the current implementation this is addressed by some additional locks to protect against that scenario, but unfortunately it means that producer and consumer cannot make progress at the same time (i.e., they stall each other). This is only revealed by 2-thread benchmarks, which I believe we don’t have in the benchmark suite yet (I added a benchmark script to my fork that can measure that).

I’ve looked around a bit to find an SPMC queue that would get the job done, and it turns out that there’s a pretty frequently used ring buffer scheme for that (used among others in xenium, a queue by Rigtorp used in Frostbite, and a few other pieces of software, and an original implementation by Vyukov, all of which are in fact MPMC). It’s actually faster than boost’s spsc_queue (because it doesn’t need to synchronize between reader and writer at all, unless they act on the same queue element). So I’ve adapted that for our consumer_queue (try_push / try_pop), in a simplified SPMC version, since we only need one producer (that gives a bit of extra performance).

I’ve done a bit of benchmarking, and on my machine (Ubuntu laptop) it comes out at >60% higher max end-to-end throughput between a push_sample and pull_sample thread relative to the current lock-based variant (~4.8 MHz instead of ~2.9MHz). With these changes, there’s only one (rarely contended and thus quite fast) lock left around the cv.notify_one() for use with a wait timeout, but since now it's faster than even the old lockfree variant, we should be ok for a while. (One can use optional locking to make the queue again fully lock-free & wait-free except when it’s full or empty, but I'd need to do a bit more testing on that first).

Other than that I’ve done a decent amount of testing, but of course we’ll want to let it simmer a bit and maybe write a few more unit tests before we can pass that on to users.

cboulay commented 3 years ago

@chkothe I know you wanted to do a little more work on this. No rush.

I just wanted to bring to our attention a user reporting CPU utilization going up 10x and I think it is because the outlet buffer is overflowing. Let's make sure to test performance in the outlet-buffer-full case.

chkothe commented 3 years ago

Yes, planning to do some more polish & account for some of the suggested changes once I'm out of my current sprint (hopefully over Christmas).

For the buffer overflow case: we need to probably emit some one-time (or periodically repeated log warning that there's a buffer overflow, since in that case the user is going to have data loss and significant transmit delay). Basically we'd want to recognize that as not a normal (steady-state) operating condition but more like a "you have a big problem with your data transmission" kind of situation.

tstenner commented 3 years ago

FYI, I've rebased the branch and pushed it here. I'll see that I'll finish the review and some benchmarks so we can get this merged for 1.16

tstenner commented 2 years ago

Looks good to me so far, the remaining comments aren't critical and can be addressed after it's merged.

I found one optimization for add_wrap(x, 1) that produces more compact assembly even with -O3:

#include <cstdint>

using T = std::size_t;
extern const T wrap_at_;

inline T _add_wrap(T x, T delta) {
    const T xp = x + delta;
    return xp >= wrap_at_ ? xp - wrap_at_ : xp;
}

T add1_wrap(T x)  {
    return ++x == wrap_at_ ? 0 : x;
}

T add_wrap(T x) {
    return _add_wrap(x, 1);
}

x86_64:

add1_wrap(unsigned long):                          # @add1_wrap(unsigned long)
        add     rdi, 1
        xor     eax, eax
        cmp     rdi, qword ptr [rip + wrap_at_]
        cmovne  rax, rdi
        ret
add_wrap(unsigned long):                           # @add_wrap(unsigned long)
        mov     rax, rdi
        add     rax, 1
        mov     rcx, qword ptr [rip + wrap_at_]
        xor     edx, edx
        cmp     rax, rcx
        cmovae  rdx, rcx
        sub     rax, rdx
        ret

ARM:

add1_wrap(unsigned long):
        adrp    x1, wrap_at_
        add     x0, x0, 1
        ldr     x1, [x1, #:lo12:wrap_at_]
        cmp     x1, x0
        csel    x0, x0, xzr, ne
        ret
add_wrap(unsigned long):
        adrp    x1, wrap_at_
        add     x0, x0, 1
        ldr     x1, [x1, #:lo12:wrap_at_]
        subs    x1, x0, x1            # avoided in add1_wrap
        csel    x0, x1, x0, cs
        ret

tstenner commented 2 years ago

I have removed the FORCEINLINE macro in 2a3cd09aea8b908a78e27dc3a2e7a527e308abfe because 1) it's only used once and will be inlined even without any inline attribute and 2) it causes linker errors with MinGW.

LIKELY and UNLIKELY are also good candidates for trouble and we already have BOOST_LIKELY and BOOST_LIKELY, so we might as well use those.

cboulay commented 2 years ago

Effectively merged in #135

@chkothe -- Should #71 be closed too?

sccn / liblsl

[Post-1.14] Replace boost::spsc_queue by a fast SPMC queue #91