Streaming improvements - Githubissues

vsonnier commented 5 years ago

Hello Charles, @zuckschwerdt and @guruofquality.

Thanks to suggestions and tests by @zuckschwerdt, I've incremented them by opening this PR to add improvements of my own together with the others. @guruofquality I don't know if @zuckschwerdt can push on this PR ?

Anyway, here are the improvements :

Completed the removal of the fill thread initiated by @zuckschwerdt : there is no need of buffer mutex at all finally, atomic counters on buffer sizes are enough. This is even simpler than SoapySDRPlay, since iio_buffers has their own buffer management.
Enabled "direct copy" for CS8 and CF32 (the format used by CubicSDR) because thanks to having the same endianness, you only have to scale and cast manually the int16_t into the destination format and don't need iio_channel_convert at all.
Removed LUT for CF32 for simplification. Given the SoapySDRPlay and others don't have it, I guess the eventual perfomance gain is not worth the additional code.

Of course, this is a work in progress:

I've not touched the TX part,
It may be entirely wrong,
I have not found the sweet spot for MTU vs. set_buffer_size vs. samplerate, and so far only managed 5Msps on CubicSDR without cracking sound.
Maybe the refill_thread would be needed after all to pre-buffer the samples, if the refill_buffer op turns out to be too slow. I hope not.

vsonnier commented 5 years ago

BTW, no crash by changing sample rate now on my side. It may not be perfect, but definitely a "less code, less bugs" trend here.

zuckschwerdt commented 5 years ago

Let me put in the bug fixes to buf first. Then we have a clean base to do bigger restructuring.

vsonnier commented 5 years ago

@zuckschwerdt No problem, do as you wish, I can always re-merge my changes later.

In the meantime, I've measured the readStream duration on CubicSDR side, to know if the iio_buffer_refill() is the bottleneck or not.

What I measure with the current code having: buffer_size = 8 * MTU with samplerate of 10Msps:

SoapySDR::Device::readStream takes 0 ms (not measurable) 7 times / 8. It means the encoding is fast enough :)
SoapySDR::Device::readStream takes 10/12 ms 1 time / 8, i.e the time it needs to call iio_buffer_refill(). ==> So it takes 10 ms to fetch 65568 samples, i.e 6.5 ms long sample at 10Msps. ==> This is slower than realtime :(. The good news is that another refill_thread wouldn't help in beating realtime.

Another attempt at 5 Msps:

SoapySDR::Device::readStream takes 0 ms (not measurable) 7 times / 8.
SoapySDR::Device::readStream takes 13 ms 1 time / 8. ==> So it takes 12/13 ms to fetch 65568 samples, i.e 13 ms long sample at 5Msps. ==> This explains why there is (almost) no crackling sound at that sample rate.

I've also measured iio_buffer_refill() directly and it matches the SoapySDR::Device::readStream refill-is-called case.

So basically irrespective of our processing power, iio_buffer_refill() is already slower than realtime beyond 5Msps on my machine at least.

@zuckschwerdt How high samplerate you could reach on your side without crackling sound on CubicSDR ?

zuckschwerdt commented 5 years ago

I see three topics worth addressing individualy here:

Tightening the copy loop. iio_channel_read() calls iio_channel_convert() on each sample, this isn't inlined and expensive. Your changes for well known cases will vectorize (should confirm that). Wouldn't call it direct copy though, I'm using that for the case which lends itself to DMA.

Default buffer size. I suspect the code is doing something like a log. A comment would be helpful ;) Testing the code gives (note buffer_size is in simple samples, not I/Q samples):

samplerate	n	buffer	latency	FPS
25000	17	4096	81.9 ms	12.2 fps
1000000	12	131072	65.5 ms	15.3 fps
2000000	11	262144	65.5 ms	15.3 fps
4000000	10	524288	65.5 ms	15.3 fps
6000000	9	1048576	87.4 ms	11.4 fps
8000000	9	1048576	65.5 ms	15.3 fps
10000000	8	2097152	104.9 ms	9.5 fps

I might have botched that table, but the code needs a cleanup and documentation anyway. What is the target latency or fps here? Is 60 fps / 17 ms what other drivers offer?

Dropping the refill_thread. As you mention the current code is just synchronized. On entry recv() simply blocks on the refill thread when the buffer is empty. But if we put

if (!items_in_buffer) {
    please_refill_buffer = true;
    cond.notify_all();
}

on the bottom of recv() we get concurrency. recv() will then only block if you call it too fast. This lends itself to a read-process loop in the consumer. Not sure if that is the anticipated model for SoapySDR, or if a tight read loop (maybe even in a thread) is the preferred model?

zuckschwerdt commented 5 years ago

The IIO internals guide has some information on buffers.

iio_buffer_push()/iio_buffer_refill() are just the high level interfaces. In the background IIO will allocate multiple buffers which are used to implement a double/triple/...-buffering scheme that allows to have continuous streaming without loosing samples. iio_buffer_push()/iio_buffer_refill() will only block if all buffers are in use. As soon as one buffer becomes available it will wakeup and let the application read/write the buffer. Communication between the libiio and the DMA core is highly asynchronous and mostly interrupt driven.

Note this function will create by default 3 buffer blocks. This default can be changed by unsing iio_device_set_kernel_buffers_count() prior in creating the buffer.

Buffers are allocated on the kernel side, and when calling iio_buffer_refill() the next buffer will be mapped into user space.

But this might all only apply to code running on the Zynq. IIO abstraction through USB or network might always block as I gather?

Using audio stutter in Cubic as test I can go up to 5M without refill_thread, 6M starts to stutter. With 32K buffer size and refill_thread at the tail of recv() I can get 6M without stutter. But nothing above is smooth.

vsonnier commented 5 years ago

Tightening the copy loop. iio_channel_read() calls iio_channel_convert() on each sample, this isn't inlined and expensive. Your changes for well known cases will vectorize (should confirm that).

Unless I'm wrong, the only thing to add compared to my direct_copy version would be swapping the bytes of the int16_t first before convert them to get the right endianness.

Default buffer size. I suspect the code is doing something like a log. A comment would be helpful ;) Testing the code gives..

I don't get that part. Which code ? The target latency on CubicSDR is < 60 fps so that the user can get a realtime display.

Dropping the refill_thread....

The code is all wrong there I'm afraid and need to be re-written based on var conds (is_not_full, is_not empty) principles and must not mix atomic variables in the process. And you need an additional bytes buffer to accumulate the samples, because iio_buffer_refill() overwrites the buffer, which is why we could only call it when it is actually empty. I think.

I was just doing that based on https://github.com/cjcliffe/CubicSDR/blob/master/src/util/ThreadBlockingQueue.h when I realized it was usless: no matter how many cascading buffers you got, if your sample generator namely iio_buffer_refill() here is no faster than relatime at producing samples, you will be on famine waiting for samples if your application comsume them in a realtime manner. Like CubicSDR does. You observe that with the 5M/6M limit on your side. 6M works also barely with my version, too.

Basically we have to find a buffer_size which maximize iio_buffer_refill() throughoutput or we may switch on another kind of I/O call that can assure relatime, if any.

For testing buffer sizes, the most practical way would be exposing the buffer length as a SoapySDR setting that can be changed before starting to stream. This would be done implementing SoapySDR::ArgInfoList getSettingInfo, readSetting, writeSetting like other SoapySDR module did.

The MTU size is not related to perfomance at all, but is a compromise between not too big reads that would not assure 60fps, and not too-small ones to prevent too many readStream calls. For instance, we could choose something like MTU = min ( buffer_size / 2, 8196)

But this might all only apply to code running on the Zynq. IIO abstraction through USB or network might always block as I gather?

I suspect the DMA optimization is only valid on the device own memory space as well. You'll always have to go through the USB pipe.

That would be all for this Week-end for me. I think we made good progress already 👍

Regards,

Vincent

zuckschwerdt commented 5 years ago

The table is about the set_buffer_size_by_samplerate selected sizes and resulting fps/latency.

zuckschwerdt commented 5 years ago

My comment about "Tightening the copy loop." was that it's a good thing to do and should go in as a separate PR. It's not a simple memcpy "direct copy", but a tighter loop that the compiler might vectorize.

vsonnier commented 5 years ago

It's not a simple memcpy "direct copy"

Call it "can_do_raw_buffer_handling" if you wish.

should go in as a separate PR

Too late ! I'm already using it. And it works ! :)

The table is about the set_buffer_size_by_samplerate selected sizes and resulting fps/latency.

Ah OK. My tests shows that at least on my machine 131072 for set_buffer size is the fastest setting. The time iio_buffer_refill() take to fetch this amount is around 10ms, and practically independent of sample rate. As your table shows, tose are way too big buffers which has indeed to be filled and so offer a big apparent "latency". For instance for 10Msps fetching 2097152 samples, i.e almost 1Ms I/Q, i.e 100 ms worth of time, takes... well 105 ms, close enough. You can't go faster than reality.

Now for more reasonable buffer sizes, a well behaved iio_buffer_refill() would indeed buffer the continuous incoming samples while the application is doing anything else, and would only memcpy them in the user space when called, resulting a fast iio_buffer_refill() call.

Looks like it is not the case, because the amount of time is almost 10 ms for 2Msps to 10 Msps for the same amount of samples, as if the iio_buffer_refill() would indeed only transfer them on-demand.

zuckschwerdt commented 5 years ago

Why did you pick 16384 as RX full scale? The data is 12 bit aligned to LSB. I.e. exclude the sign bit and you get -2048 – +2047. Scale that to -1 – 1 by dividing with 2048.

zuckschwerdt commented 5 years ago

Too late ! I'm already using it. And it works ! :)

I favor putting distinct topics into distinct PRs. Just for the sake of others following this, the changelog, and perhaps bisecting changes if needed. (This assumes we squash merge. If you rebase interactive to distinct topic commits that would work too.) But that's only relevant when you merge. Trying things out in a single PR here is easier for now.

I guess to really nail a perfect buffer size or a formula depending on sample rate we need to test various IIO abstraction: native, USB, network. My hunch is as native already has tripple-buffering and would be best with small buffers. Network might be best aligned around network MTU and USB performs badly with small buffers in my experience.

Btw. SDRPlay, HackRF and AirspyHF have fixed 64k buffers, BladeRF has 4k. They all return an MTU of the actual buffer length. I think returning MTU < buffers size is safe, but likely not optimal?

vsonnier commented 5 years ago

But that's only relevant when you merge. Trying things out in a single PR here is easier for now.

Yes, and we are in no hurry to merge in master. It is just a way to make an experimental branch so that others can add commits on it. For now the changes are not that big. If the refill_thread is to make a glorious return, it would need to be remade from scratch anyway so better get rid of the code now. Please contribute here instead of your own repo ! Even if you create other intermediate PRs or simply branches it will be easier to compare or merge them in this main canonical repo afterwards.

Btw. SDRPlay, HackRF and AirspyHF have fixed 64k buffers, BladeRF has 4k. They all return an MTU of the actual buffer length. I think returning MTU < buffers size is safe, but likely not optimal?

That is just fine apparently when I tested. The only difference is that there will be one iio_buffer_refill() per readStream call. For those who follow the getStreamMTU() suggestion anyway, because we can indeed specified any wanted size to readStream. In principle, the MTU is just the "best" size to choose. Whatever, buffer_size = MTU = fixed is fine by me.

zuckschwerdt commented 5 years ago

Using either a refill_thread (with or without the refill at tail of recv) or without the refill_thread or even with triple buffering in the refill thread it always tops out at 6.3 Msps / 25.3 MBps for me (SoapyRateTest). In fact running a stripped down version of ad9361-iiostream.c shows the same limit. I guess that's just what IIO can deliver using USB. With SoapyPlutoSDR running on the Pluto and using SoapyRemote with CS8 we could perhaps achieve 10 Msps at reduced sample depth – not sure if that is worth to support.

vsonnier commented 5 years ago

Thanks for the extensive testing. I guess we can definitely strip refill_thread ans simply go synchroneous single-threaed in recv. We don't even need atomic variables at all.

With SoapyPlutoSDR running on the Pluto and using SoapyRemote with CS8 we could perhaps achieve 10 Msps at reduced sample depth – not sure if that is worth to support.

Well technically we could implement SOAPY_SDR_CS12 on the module side, (3 bytes for complex samples) encoded that way, according to SoapyRemote:

//the 3 bytes (in being a uint8_t*)
 uint16_t part0 = uint16_t(*(in++));
 uint16_t part1 = uint16_t(*(in++));
uint16_t part2 = uint16_t(*(in++));
//the 2 resulting I/Q in a int_16 format.
int16_t i = int16_t((part1 << 12) | (part0 << 4));
int16_t q = int16_t((part2 << 8) | (part1 & 0xf0));

in ClientStreamData::convertRecvBuffs

Apparently the SoapySDRServer can stream (packed) CS12 which could be unpacked on the application side by the more normal CS16/CF32. The Pluto module on Pluto side would use on the Pluto itself some super-great DMA method to fetch samples.

6.3Msps x 32 bits is 200Mbit/s alone and I've read somewhere that USB bandwidth is reserved for TX. If it is not the case, I dont' see why the custom solution above could not go higher than 200 Mbit/s since the theoritical limit is 400Mb/s for USB2. Other devices can stream 10Msps (Airspy, SDRPlay) but they are RX only.

Out of curiosity I tried SDRSharp with the PlutoSDR plugin, which apparently use the Net interface. Samplerates up to 10Msps are available, but any setting > 5Msps is marked "(not supported)", go figure.

vsonnier commented 5 years ago

@zuckschwerdt Charles (@cjcliffe) and I are in contact with Robin Getz of Analog Devices by e-mail maybe we can ask him the maximum Msps we could get from the device before starting a lost quest.

zuckschwerdt commented 5 years ago

Great! I briefly though about CS12 too, but wasn't sure that's even an option in SoapyRemote. So 8Msps with CS12 and 10Msps at CS8, I'll explore that soon.

Are you ok with me splitting out the fast-copy changes and perparing that for merge, then we can focus on buffer size and (non-)threading here.

vsonnier commented 5 years ago

Are you ok with me splitting out the fast-copy changes and perparing that for merge,

Excuse my French, do you mean just adding the fast-copy enhancements to master for now ? yes, do it, by all means ! The best way is probably opening a new PR containing just that.

zuckschwerdt commented 5 years ago

I already have CS12 working (just a hack for now). I'll add that after the fast-copy for testing.

vsonnier commented 5 years ago

I just tested at 6.3Msps on ClubicSDR, on a 200Kz FM stereo station. Same as you, with 6.4Msps starts the cracking sound.

I already have CS12 working (just a hack for now). I'll add that after the fast-copy for testing.

So you really plan to run the module on the Pluto itself, together with a SoapySDRServer ? Wow. !

guruofquality commented 5 years ago

Great! I briefly though about CS12 too, but wasn't sure that's even an option in SoapyRemote. So 8Msps with CS12 and 10Msps at CS8, I'll explore that soon.

The client should support the conversion between CS12 over the network to CS16 or CF32: https://github.com/pothosware/SoapyRemote/blob/master/client/ClientStreamData.hpp#L11

So if you advertise the native format as CS12, SoapyRemote should be ok. Caveat is that CS12 could be packed differently than expected, so thats something to look out for.

zuckschwerdt commented 5 years ago

Cross compiling and running stuff on the Pluto isn't hard. It's just a pain to setup the compiler. I have SoapySDR with module, GPSd, Chrony, rtl_433 and other stuff on there already. I might offer a binary package someday if there is some interest in this.

Will SoapyRemote always use the native format (CS16) for network? If the SoapyPlutoSDR module offers CS12 and CS8, and the client side requests that, will SoapyRemote on the server-side pick that for transport? I'll investigate in a day or so.

zuckschwerdt commented 5 years ago

Okay, I can answer this now. If SoapyRemote is running on the Pluto and an application requests CS8 or CS12 then SoapyRemote is smart enough to use that over the network:

[INFO] Configured sender endpoint: dgram=1452 bytes, 714 elements @ 2 bytes, window=16 KiB

for CS8, and for CS12:

[INFO] Configured sender endpoint: dgram=1452 bytes, 476 elements @ 3 bytes, window=16 KiB

CS8 works right now and I also got working CS12 code on top of #16. PR soon.

zuckschwerdt commented 5 years ago

Looking at https://github.com/pothosware/SoapyRemote/blob/master/client/ClientStreamData.cpp#L83 https://github.com/pothosware/SoapyRemote/blob/master/client/ClientStreamData.cpp#L214

I get the impression that CS12 is ii qI QQ (uppercase MSB). @guruofquality @vsonnier Is this right? Is there any spec on CS12? It seems to work, looking at the waterfall in CubicSDR. But there is too much gain (noise floor is way too high).

vsonnier commented 5 years ago

@zuckschwerdt, looking at https://github.com/pothosware/SoapyRemote/blob/master/client/ClientStreamData.cpp#L83 using pen-and-paper lets say types are described in [MSBit ... LSBit], part0, part1, part2 = [7 .. 0] and we want to make int16_t = [15 .. 0] for I and Q.

I get: I = [3 ..0 of part1, 7 .. 0 of part0, Zeroes] Q = [7 ..0 of part2, 7..4 or part1, Zeroes]

Looks like I/Q are MSBit aligned here.

zuckschwerdt commented 5 years ago

You got part0 and part1 swapped in the I line, right? Otherwise that's my take too.

vsonnier commented 5 years ago

@zuckschwerdt Right, corrected now. Looking at https://github.com/pothosware/SoapyRemote/blob/master/client/ClientStreamData.cpp#L214, it looks inconsistent because it would be LSBit aligned ? Update: Nervermind, looks right.

Maybe the easiest way is to test so that I = 0x123 and Q= 0xABC = constants and see how it translates on Cubic side.

guruofquality commented 5 years ago

@guruofquality @vsonnier Is this right? Is there any spec on CS12? It seems to work, looking at the waterfall in CubicSDR. But there is too much gain (noise floor is way too high).

I dont know that there is any spec. I just went with something that matched how you might pack 12 bit in an FPGA. So in verilog wire [23:0] packed24 = {q[15:4], i[15:4]};

If you were packing the individual bytes from a signed 16-bit number, it would look like this (so that there the masks and shifts come from in the C code).

char byte0 = i[11:4];
char byte1 = {q[7:4], i[15:12]};
char byte2 = q[15:8];

So that code is removing the lower 4 bits, the LSB. So yes, its MSB packed with I being in the lower byte and lower nibble of the second byte. Similar to the other formats where I is in the lower half of the number or lower bytes (little endian).

vsonnier commented 5 years ago

@zuckschwerdt Maybe you just forgot to MSB align the RX samples coming from the device to create CS12. The CS12 upper bits would be zeroes in such case, with the original upper bits acting on CS12 lower ones creating a small dynamic. Or something like that.

zuckschwerdt commented 5 years ago

CS12 is ready for testing in #19 now. Fixing the refill_thread to allow testing was painful. It took #18 and two more fixes: b6753f4 and d685dc5 – glad to see the broken threading removed.

vsonnier commented 5 years ago

@zuckschwerdt I have rebased the streaming_improvements branch over the current master, so on top of CS12 support.

zuckschwerdt commented 5 years ago

I'll soon test. First notes: I'd expect the rx/tx streams to be independed. I.e. if I need two setupStream() calls to open streams I would not expect one closeStream() to remove both?

vsonnier commented 5 years ago

@zuckschwerdt You are right. I fixed this, plus replaced shared_ptr by unique_ptr because those streamers don't escape the parent object and should not, so shared_ptr is overkill or merely misleading here.

zuckschwerdt commented 5 years ago

The Pluto is advertised as full-duplex. I don't know how much of that is true, perhaps independent control of both streams would be possible. E.g. running different rx/tx threads, with start/stop at different times.

vsonnier commented 5 years ago

Indeed. So I made a change to return 2 distinct SoapySDR::Stream* for RX and TX.

vsonnier commented 5 years ago

E.g. running different rx/tx threads, with start/stop at different times.

If you follow that path, you'll fall into the rabbit hole and soon realize that everything have to be locked by the same device_mutex that have to lock both streaming and settings.

I've tried to put std::lock_guard<std::mutex> lock(device_mutex); into readStream but it had noticable performance impact.

The coward solution is to let readStream unprotected. What about spin-locks though, because we only need to lock when "changing settings" and "streaming" are actually concurrent, which is not very often in practice.

Unfortunatly they are not standard issue in C++11, but strangely it provides the std::atomic_flag construct: https://en.cppreference.com/w/cpp/atomic/atomic_flag, here complete with a spin-lock example.

So in the last commit, I made my own pluto_spin_mutex to use by device_mutex itself protecting everything. readStream seems OK as the "unprotected" version, changing settings though seem slower. But I think this is not an issue. Lets test and see how fast we can get with this safe version.

zuckschwerdt commented 5 years ago

Part of the headache I had with the threading was concurrent access to stop() and recv(). There might still be a thread draining the buffer when stop() is called, and killing the buffer is a good way to crash while recv() is underway. Also something to look out for. Perhaps removing buf can be defered to the destructor. But that also means once buf is set up it can't safely be changed.

vsonnier commented 5 years ago

There might still be a thread draining the buffer when stop()

Yes that is the mean issue. With the "lock everything" strategy, a xxxStream operation running can no longer be interrupted by a closeStream or similar, so the crash can't happen. In the inverse situation if a closeStream is underway, the xxxxStream op will have to wait to proceed. The only thing we need to assure is that xxxxStream op will cleanly do nothing when the terminating closeStream release the lock.

Basically Settings and Streaming ops get interleaved properly, so I think we may change any buffer sizing at runtime and it would still work.

zuckschwerdt commented 5 years ago

Sorry, I meant that with a lockless or atomic variant. With the mutex lock it guards nicely, but avoiding a mutex would probably yield better performance.

vsonnier commented 5 years ago

Well it is a called a mutex but here is just a CAS loop, which will almost always succeed the first time anyway because the actual contention is rare. I don't think we'll see any runing performance impact on streaming itself compared to no lock at all.

The only visible effect would be that a changing setting will wait for the current streaming operation to complete to take effect. At that particular time the spin-lock is likely to be slower than anything else because of its wasting of CPU cycles.

vsonnier commented 5 years ago

RX and TX are indeed quite independent, so I've made changes to have different RX and TX locks, plus some other cleanups. Now we could have concurrent readStream and writeStream in theory. I think I fixed something in the process: Previously when I was doing Start/Stop device in Cubic, on SoapyRemote connexion on localhost, it would crash both Cubic and the SoapySDRServer. Now it works.

zuckschwerdt commented 5 years ago

Works very well and fixed spurious crash on start/stop for me. Neither Clang-9 nor GCC-8 recognise std::make_unique, strangely enough not even in C++14 mode. I had to change that to

    this->rx_stream = std::unique_ptr<rx_streamer>(new rx_streamer(rx_dev, streamFormat, channels, args));

and

    this->tx_stream = std::unique_ptr<tx_streamer>(new tx_streamer(tx_dev, streamFormat, channels, args));

which I gather are equivalent, just not so nice looking.

vsonnier commented 5 years ago

My bad, std::make_unique is indeed C++14, let's stick on C++11 then.

zuckschwerdt commented 5 years ago

Tested and works well for me. Can be merged (squash merge please) when you see fit, in my opinion.

vsonnier commented 5 years ago

Alea jacta est !

pothosware / SoapyPlutoSDR

Streaming improvements #14