sccn / liblsl

C++ lsl library for multi-modal time-synched data transmission over the local network
Other
107 stars 63 forks source link

Replace boost::spsc_queue by a fast SPMC queue #135

Closed cboulay closed 2 years ago

cboulay commented 2 years ago

Please refer to #91 by @chkothe and the discussion there. This is simply the rebased version from @tstenner and with the latest master merged in.

I happen to have a need to test this so I thought I'd setup the PR too.

@chkothe , I would prefer to not have this PR and instead use only the original PR in #91. However, I think anything you do to rewrite the history would invalidate the in-line conversations and it would be a shame to lose those. I suppose you could stack on a bunch of revert commits, and then add back on the commits from Tristan's fork, but that would create a messy history in liblsl. I don't know of a perfect solution here. I leave it up to you.

cboulay commented 2 years ago

Using this branch... I'm pushing at a samplerate of 30 kHz, trying to see how many int16 channels I can support. To keep things simple, I'm setting max_buff to 1 second on both the outlet and inlet (not sure which takes precedence when they disagree, I assume the largest).

On localhost outlet and inlet on my Windows machine (i5-8400), the maximum number of channels I can support is about 2500 before the inlet can't keep up. At that channel count, SendDataInChunks uses 3-4% CPU 800 MB memory. ReceiveDataInChunks uses 45-50% CPU and 900 MB memory!

My Macbook Pro is quite old (late 2013, i7-4850 2.3 GHz). SendDataInChunks on its own takes 93% CPU with only 800 channels. No point in testing localhost inlet. If I run an inlet on my Windows PC, the connection maintains a solid 30 kHz no problem.

iperf3 tells me that my WinPC --> Mac connection speed is 945 Mbit/s. So I'm well below network saturation at this point. I'm clearly CPU-bound on my Mac. The PC CPU is only supposed to be about 30% faster than the Mac CPU so I wonder if there's something else going on.

I'll fire up a similar test on main

cboulay commented 2 years ago

On main, similar test to above:

Win-localhost, SendDataInChunks at 2500 channels, 30 kHz samplerate, uses 10% CPU, similar RAM. ReceiveDataInChunks uses about 45% CPU, so similar to spmc-queue branch.

I won't bother testing the other configs.

So the spmc-queue branch brings CPU usage from 10% down to about 4% on my Windows PC. That's pretty good. Speeding up the outlet is more relevant than speeding up the inlet because it's the outlet that's most likely to run on an embedded device. I'll try to find some time this weekend to see what I can get out of my raspberry pi.

cboulay commented 2 years ago

(Mostly for my own reference) Raspberry Pi 4 (ARM Cortex-A72, 4 GB RAM) hits CPU 100% at around 400 channels @ 30 kHz. Sadly iperf reported only 50 Mbit (same ethernet cable/switch that I got 945 out of). When pulling in Windows, Tx fell behind if the origin channel count was higher than ~350 (still @ 30 kHz).

chkothe commented 2 years ago

Thanks for testing - I also have a pretty neat test script in my repo, let me dig that up tomorrow. This PR lays a nice foundation beyond which one can apply some additional, more aggressive optimisations when the time is right. I have one that replaces the remaining lock with a custom-tailored sync primitive that I used to reach 10 MHz on a ThinkPad T495 -- however I'd be hesitant to drop something like that in production without a convincing correctness proof or painful amounts of testing.

tstenner commented 2 years ago

I have one that replaces the remaining lock with a custom-tailored sync primitive that I used to reach 10 MHz on a ThinkPad T495

C++20 has atomic_wait in the stdlib and I've seen backports to C++11. This could replace the whole mutex+wait for condition_var circus with a very fast write_idx_.wait(old_pos).

chkothe commented 2 years ago

Ok as far as I'm concerned I'm good to go with that - @tstenner any vetoes about merging this in? Btw, picking up the discussion from the other PR, thx for the add_wrap optimizations & checking that force-inlining isn't necessary. Really wish I had more time to do those cleanups!