Closed cboulay closed 2 years ago
Using this branch...
I'm pushing at a samplerate of 30 kHz, trying to see how many int16
channels I can support. To keep things simple, I'm setting max_buff to 1 second on both the outlet and inlet (not sure which takes precedence when they disagree, I assume the largest).
On localhost outlet and inlet on my Windows machine (i5-8400), the maximum number of channels I can support is about 2500 before the inlet can't keep up. At that channel count, SendDataInChunks uses 3-4% CPU 800 MB memory. ReceiveDataInChunks uses 45-50% CPU and 900 MB memory!
My Macbook Pro is quite old (late 2013, i7-4850 2.3 GHz). SendDataInChunks on its own takes 93% CPU with only 800 channels. No point in testing localhost inlet. If I run an inlet on my Windows PC, the connection maintains a solid 30 kHz no problem.
iperf3 tells me that my WinPC --> Mac connection speed is 945 Mbit/s. So I'm well below network saturation at this point. I'm clearly CPU-bound on my Mac. The PC CPU is only supposed to be about 30% faster than the Mac CPU so I wonder if there's something else going on.
I'll fire up a similar test on main
On main
, similar test to above:
Win-localhost, SendDataInChunks
at 2500 channels, 30 kHz samplerate, uses 10% CPU, similar RAM.
ReceiveDataInChunks
uses about 45% CPU, so similar to spmc-queue
branch.
I won't bother testing the other configs.
So the spmc-queue
branch brings CPU usage from 10% down to about 4% on my Windows PC. That's pretty good. Speeding up the outlet is more relevant than speeding up the inlet because it's the outlet that's most likely to run on an embedded device. I'll try to find some time this weekend to see what I can get out of my raspberry pi.
(Mostly for my own reference) Raspberry Pi 4 (ARM Cortex-A72, 4 GB RAM) hits CPU 100% at around 400 channels @ 30 kHz. Sadly iperf reported only 50 Mbit (same ethernet cable/switch that I got 945 out of). When pulling in Windows, Tx fell behind if the origin channel count was higher than ~350 (still @ 30 kHz).
Thanks for testing - I also have a pretty neat test script in my repo, let me dig that up tomorrow. This PR lays a nice foundation beyond which one can apply some additional, more aggressive optimisations when the time is right. I have one that replaces the remaining lock with a custom-tailored sync primitive that I used to reach 10 MHz on a ThinkPad T495 -- however I'd be hesitant to drop something like that in production without a convincing correctness proof or painful amounts of testing.
I have one that replaces the remaining lock with a custom-tailored sync primitive that I used to reach 10 MHz on a ThinkPad T495
C++20 has atomic_wait
in the stdlib and I've seen backports to C++11. This could replace the whole mutex
+wait for condition_var
circus with a very fast write_idx_.wait(old_pos)
.
Ok as far as I'm concerned I'm good to go with that - @tstenner any vetoes about merging this in? Btw, picking up the discussion from the other PR, thx for the add_wrap optimizations & checking that force-inlining isn't necessary. Really wish I had more time to do those cleanups!
Please refer to #91 by @chkothe and the discussion there. This is simply the rebased version from @tstenner and with the latest master merged in.
I happen to have a need to test this so I thought I'd setup the PR too.
@chkothe , I would prefer to not have this PR and instead use only the original PR in #91. However, I think anything you do to rewrite the history would invalidate the in-line conversations and it would be a shame to lose those. I suppose you could stack on a bunch of
revert
commits, and then add back on the commits from Tristan's fork, but that would create a messy history in liblsl. I don't know of a perfect solution here. I leave it up to you.