Closed vsonnier closed 5 years ago
BTW, no crash by changing sample rate now on my side. It may not be perfect, but definitely a "less code, less bugs" trend here.
Let me put in the bug fixes to buf
first. Then we have a clean base to do bigger restructuring.
@zuckschwerdt No problem, do as you wish, I can always re-merge my changes later.
In the meantime, I've measured the readStream
duration on CubicSDR side, to know if the iio_buffer_refill()
is the bottleneck or not.
What I measure with the current code having: buffer_size = 8 * MTU with samplerate of 10Msps:
SoapySDR::Device::readStream
takes 0 ms (not measurable) 7 times / 8. It means the encoding is fast enough :)SoapySDR::Device::readStream
takes 10/12 ms 1 time / 8, i.e the time it needs to call iio_buffer_refill()
.
==> So it takes 10 ms to fetch 65568 samples, i.e 6.5 ms long sample at 10Msps.
==> This is slower than realtime :(.
The good news is that another refill_thread
wouldn't help in beating realtime.Another attempt at 5 Msps:
SoapySDR::Device::readStream
takes 0 ms (not measurable) 7 times / 8.SoapySDR::Device::readStream
takes 13 ms 1 time / 8.
==> So it takes 12/13 ms to fetch 65568 samples, i.e 13 ms long sample at 5Msps.
==> This explains why there is (almost) no crackling sound at that sample rate. I've also measured iio_buffer_refill()
directly and it matches the SoapySDR::Device::readStream
refill-is-called case.
So basically irrespective of our processing power, iio_buffer_refill()
is already slower than realtime beyond 5Msps on my machine at least.
@zuckschwerdt How high samplerate you could reach on your side without crackling sound on CubicSDR ?
I see three topics worth addressing individualy here:
Tightening the copy loop. iio_channel_read() calls iio_channel_convert() on each sample, this isn't inlined and expensive. Your changes for well known cases will vectorize (should confirm that). Wouldn't call it direct copy though, I'm using that for the case which lends itself to DMA.
Default buffer size. I suspect the code is doing something like a log. A comment would be helpful ;) Testing the code gives (note buffer_size is in simple samples, not I/Q samples):
samplerate | n | buffer | latency | FPS |
---|---|---|---|---|
25000 | 17 | 4096 | 81.9 ms | 12.2 fps |
1000000 | 12 | 131072 | 65.5 ms | 15.3 fps |
2000000 | 11 | 262144 | 65.5 ms | 15.3 fps |
4000000 | 10 | 524288 | 65.5 ms | 15.3 fps |
6000000 | 9 | 1048576 | 87.4 ms | 11.4 fps |
8000000 | 9 | 1048576 | 65.5 ms | 15.3 fps |
10000000 | 8 | 2097152 | 104.9 ms | 9.5 fps |
I might have botched that table, but the code needs a cleanup and documentation anyway. What is the target latency or fps here? Is 60 fps / 17 ms what other drivers offer?
Dropping the refill_thread. As you mention the current code is just synchronized. On entry recv() simply blocks on the refill thread when the buffer is empty. But if we put
if (!items_in_buffer) {
please_refill_buffer = true;
cond.notify_all();
}
on the bottom of recv() we get concurrency. recv() will then only block if you call it too fast. This lends itself to a read-process loop in the consumer. Not sure if that is the anticipated model for SoapySDR, or if a tight read loop (maybe even in a thread) is the preferred model?
The IIO internals guide has some information on buffers.
iio_buffer_push()/iio_buffer_refill() are just the high level interfaces. In the background IIO will allocate multiple buffers which are used to implement a double/triple/...-buffering scheme that allows to have continuous streaming without loosing samples. iio_buffer_push()/iio_buffer_refill() will only block if all buffers are in use. As soon as one buffer becomes available it will wakeup and let the application read/write the buffer. Communication between the libiio and the DMA core is highly asynchronous and mostly interrupt driven.
Note this function will create by default 3 buffer blocks. This default can be changed by unsing iio_device_set_kernel_buffers_count() prior in creating the buffer.
Buffers are allocated on the kernel side, and when calling iio_buffer_refill() the next buffer will be mapped into user space.
But this might all only apply to code running on the Zynq. IIO abstraction through USB or network might always block as I gather?
Using audio stutter in Cubic as test I can go up to 5M without refill_thread, 6M starts to stutter. With 32K buffer size and refill_thread at the tail of recv() I can get 6M without stutter. But nothing above is smooth.
Tightening the copy loop. iio_channel_read() calls iio_channel_convert() on each sample, this isn't inlined and expensive. Your changes for well known cases will vectorize (should confirm that).
Unless I'm wrong, the only thing to add compared to my direct_copy
version would be swapping the bytes of the int16_t first before convert them to get the right endianness.
Default buffer size. I suspect the code is doing something like a log. A comment would be helpful ;) Testing the code gives..
I don't get that part. Which code ? The target latency on CubicSDR is < 60 fps so that the user can get a realtime display.
Dropping the refill_thread....
The code is all wrong there I'm afraid and need to be re-written based on var conds (is_not_full, is_not empty) principles and must not mix atomic variables in the process. And you need an additional bytes buffer to accumulate the samples, because iio_buffer_refill()
overwrites the buffer, which is why we could only call it when it is actually empty. I think.
I was just doing that based on
https://github.com/cjcliffe/CubicSDR/blob/master/src/util/ThreadBlockingQueue.h
when I realized it was usless: no matter how many cascading buffers you got, if your sample generator namely iio_buffer_refill()
here is no faster than relatime at producing samples, you will be on famine waiting for samples if your application comsume them in a realtime manner. Like CubicSDR does.
You observe that with the 5M/6M limit on your side. 6M works also barely with my version, too.
Basically we have to find a buffer_size which maximize iio_buffer_refill()
throughoutput or we may switch on another kind of I/O call that can assure relatime, if any.
For testing buffer sizes, the most practical way would be exposing the buffer length as a SoapySDR setting that can be changed before starting to stream. This would be done implementing
SoapySDR::ArgInfoList getSettingInfo
, readSetting
, writeSetting
like other SoapySDR module did.
The MTU size is not related to perfomance at all, but is a compromise between not too big reads that would not assure 60fps, and not too-small ones to prevent too many readStream
calls.
For instance, we could choose something like MTU = min ( buffer_size / 2, 8196)
But this might all only apply to code running on the Zynq. IIO abstraction through USB or network might always block as I gather?
I suspect the DMA optimization is only valid on the device own memory space as well. You'll always have to go through the USB pipe.
That would be all for this Week-end for me. I think we made good progress already š
Regards,
Vincent
The table is about the set_buffer_size_by_samplerate selected sizes and resulting fps/latency.
My comment about "Tightening the copy loop." was that it's a good thing to do and should go in as a separate PR. It's not a simple memcpy "direct copy", but a tighter loop that the compiler might vectorize.
It's not a simple memcpy "direct copy"
Call it "can_do_raw_buffer_handling" if you wish.
should go in as a separate PR
Too late ! I'm already using it. And it works ! :)
The table is about the set_buffer_size_by_samplerate selected sizes and resulting fps/latency.
Ah OK. My tests shows that at least on my machine 131072 for set_buffer size is the fastest setting. The time iio_buffer_refill()
take to fetch this amount is around 10ms, and practically independent of sample rate.
As your table shows, tose are way too big buffers which has indeed to be filled and so offer a big apparent "latency".
For instance for 10Msps fetching 2097152 samples, i.e almost 1Ms I/Q, i.e 100 ms worth of time, takes... well 105 ms, close enough. You can't go faster than reality.
Now for more reasonable buffer sizes, a well behaved iio_buffer_refill()
would indeed buffer the continuous incoming samples while the application is doing anything else, and would only memcpy them in the user space when called, resulting a fast iio_buffer_refill()
call.
Looks like it is not the case, because the amount of time is almost 10 ms for 2Msps to 10 Msps for the same amount of samples, as if the iio_buffer_refill()
would indeed only transfer them on-demand.
Why did you pick 16384 as RX full scale? The data is 12 bit aligned to LSB. I.e. exclude the sign bit and you get -2048 ā +2047. Scale that to -1 ā 1 by dividing with 2048.
Too late ! I'm already using it. And it works ! :)
I favor putting distinct topics into distinct PRs. Just for the sake of others following this, the changelog, and perhaps bisecting changes if needed. (This assumes we squash merge. If you rebase interactive to distinct topic commits that would work too.) But that's only relevant when you merge. Trying things out in a single PR here is easier for now.
I guess to really nail a perfect buffer size or a formula depending on sample rate we need to test various IIO abstraction: native, USB, network. My hunch is as native already has tripple-buffering and would be best with small buffers. Network might be best aligned around network MTU and USB performs badly with small buffers in my experience.
Btw. SDRPlay, HackRF and AirspyHF have fixed 64k buffers, BladeRF has 4k. They all return an MTU of the actual buffer length. I think returning MTU < buffers size is safe, but likely not optimal?
But that's only relevant when you merge. Trying things out in a single PR here is easier for now.
Yes, and we are in no hurry to merge in master. It is just a way to make an experimental branch so that others can add commits on it.
For now the changes are not that big. If the refill_thread
is to make a glorious return, it would need to be remade from scratch anyway so better get rid of the code now.
Please contribute here instead of your own repo ! Even if you create other intermediate PRs or simply branches it will be easier to compare or merge them in this main canonical repo afterwards.
Btw. SDRPlay, HackRF and AirspyHF have fixed 64k buffers, BladeRF has 4k. They all return an MTU of the actual buffer length. I think returning MTU < buffers size is safe, but likely not optimal?
That is just fine apparently when I tested. The only difference is that there will be one iio_buffer_refill()
per readStream
call.
For those who follow the getStreamMTU()
suggestion anyway, because we can indeed specified any wanted size to readStream
. In principle, the MTU is just the "best" size to choose.
Whatever, buffer_size = MTU = fixed is fine by me.
Using either a refill_thread (with or without the refill at tail of recv) or without the refill_thread or even with triple buffering in the refill thread it always tops out at 6.3 Msps / 25.3 MBps for me (SoapyRateTest). In fact running a stripped down version of ad9361-iiostream.c shows the same limit. I guess that's just what IIO can deliver using USB. With SoapyPlutoSDR running on the Pluto and using SoapyRemote with CS8 we could perhaps achieve 10 Msps at reduced sample depth ā not sure if that is worth to support.
Thanks for the extensive testing. I guess we can definitely strip refill_thread
ans simply go synchroneous single-threaed in recv
. We don't even need atomic variables at all.
With SoapyPlutoSDR running on the Pluto and using SoapyRemote with CS8 we could perhaps achieve 10 Msps at reduced sample depth ā not sure if that is worth to support.
Well technically we could implement SOAPY_SDR_CS12
on the module side, (3 bytes for complex samples) encoded that way, according to SoapyRemote
:
//the 3 bytes (in being a uint8_t*)
uint16_t part0 = uint16_t(*(in++));
uint16_t part1 = uint16_t(*(in++));
uint16_t part2 = uint16_t(*(in++));
//the 2 resulting I/Q in a int_16 format.
int16_t i = int16_t((part1 << 12) | (part0 << 4));
int16_t q = int16_t((part2 << 8) | (part1 & 0xf0));
in ClientStreamData::convertRecvBuffs
Apparently the SoapySDRServer can stream (packed) CS12 which could be unpacked on the application side by the more normal CS16/CF32. The Pluto module on Pluto side would use on the Pluto itself some super-great DMA method to fetch samples.
6.3Msps x 32 bits is 200Mbit/s alone and I've read somewhere that USB bandwidth is reserved for TX. If it is not the case, I dont' see why the custom solution above could not go higher than 200 Mbit/s since the theoritical limit is 400Mb/s for USB2. Other devices can stream 10Msps (Airspy, SDRPlay) but they are RX only.
Out of curiosity I tried SDRSharp with the PlutoSDR plugin, which apparently use the Net interface. Samplerates up to 10Msps are available, but any setting > 5Msps is marked "(not supported)", go figure.
@zuckschwerdt Charles (@cjcliffe) and I are in contact with Robin Getz of Analog Devices by e-mail maybe we can ask him the maximum Msps we could get from the device before starting a lost quest.
Great! I briefly though about CS12 too, but wasn't sure that's even an option in SoapyRemote. So 8Msps with CS12 and 10Msps at CS8, I'll explore that soon.
Are you ok with me splitting out the fast-copy changes and perparing that for merge, then we can focus on buffer size and (non-)threading here.
Are you ok with me splitting out the fast-copy changes and perparing that for merge,
Excuse my French, do you mean just adding the fast-copy enhancements to master for now ? yes, do it, by all means ! The best way is probably opening a new PR containing just that.
I already have CS12 working (just a hack for now). I'll add that after the fast-copy for testing.
I just tested at 6.3Msps on ClubicSDR, on a 200Kz FM stereo station. Same as you, with 6.4Msps starts the cracking sound.
I already have CS12 working (just a hack for now). I'll add that after the fast-copy for testing.
So you really plan to run the module on the Pluto itself, together with a SoapySDRServer ? Wow. !
Great! I briefly though about CS12 too, but wasn't sure that's even an option in SoapyRemote. So 8Msps with CS12 and 10Msps at CS8, I'll explore that soon.
The client should support the conversion between CS12 over the network to CS16 or CF32: https://github.com/pothosware/SoapyRemote/blob/master/client/ClientStreamData.hpp#L11
So if you advertise the native format as CS12, SoapyRemote should be ok. Caveat is that CS12 could be packed differently than expected, so thats something to look out for.
Cross compiling and running stuff on the Pluto isn't hard. It's just a pain to setup the compiler. I have SoapySDR with module, GPSd, Chrony, rtl_433 and other stuff on there already. I might offer a binary package someday if there is some interest in this.
Will SoapyRemote always use the native format (CS16) for network? If the SoapyPlutoSDR module offers CS12 and CS8, and the client side requests that, will SoapyRemote on the server-side pick that for transport? I'll investigate in a day or so.
Okay, I can answer this now. If SoapyRemote is running on the Pluto and an application requests CS8 or CS12 then SoapyRemote is smart enough to use that over the network:
[INFO] Configured sender endpoint: dgram=1452 bytes, 714 elements @ 2 bytes, window=16 KiB
for CS8, and for CS12:
[INFO] Configured sender endpoint: dgram=1452 bytes, 476 elements @ 3 bytes, window=16 KiB
CS8 works right now and I also got working CS12 code on top of #16. PR soon.
Looking at https://github.com/pothosware/SoapyRemote/blob/master/client/ClientStreamData.cpp#L83 https://github.com/pothosware/SoapyRemote/blob/master/client/ClientStreamData.cpp#L214
I get the impression that CS12 is ii qI QQ
(uppercase MSB).
@guruofquality @vsonnier Is this right? Is there any spec on CS12? It seems to work, looking at the waterfall in CubicSDR. But there is too much gain (noise floor is way too high).
@zuckschwerdt, looking at https://github.com/pothosware/SoapyRemote/blob/master/client/ClientStreamData.cpp#L83 using pen-and-paper lets say types are described in [MSBit ... LSBit], part0, part1, part2 = [7 .. 0] and we want to make int16_t = [15 .. 0] for I and Q.
I get: I = [3 ..0 of part1, 7 .. 0 of part0, Zeroes] Q = [7 ..0 of part2, 7..4 or part1, Zeroes]
Looks like I/Q are MSBit aligned here.
You got part0 and part1 swapped in the I line, right? Otherwise that's my take too.
@zuckschwerdt Right, corrected now. Looking at https://github.com/pothosware/SoapyRemote/blob/master/client/ClientStreamData.cpp#L214, it looks inconsistent because it would be LSBit aligned ? Update: Nervermind, looks right.
Maybe the easiest way is to test so that I = 0x123
and Q= 0xABC
= constants and see how it translates on Cubic side.
@guruofquality @vsonnier Is this right? Is there any spec on CS12? It seems to work, looking at the waterfall in CubicSDR. But there is too much gain (noise floor is way too high).
I dont know that there is any spec. I just went with something that matched how you might pack 12 bit in an FPGA. So in verilog wire [23:0] packed24 = {q[15:4], i[15:4]};
If you were packing the individual bytes from a signed 16-bit number, it would look like this (so that there the masks and shifts come from in the C code).
char byte0 = i[11:4];
char byte1 = {q[7:4], i[15:12]};
char byte2 = q[15:8];
So that code is removing the lower 4 bits, the LSB. So yes, its MSB packed with I being in the lower byte and lower nibble of the second byte. Similar to the other formats where I is in the lower half of the number or lower bytes (little endian).
@zuckschwerdt Maybe you just forgot to MSB align the RX samples coming from the device to create CS12
. The CS12 upper bits would be zeroes in such case, with the original upper bits acting on CS12 lower ones creating a small dynamic. Or something like that.
CS12 is ready for testing in #19 now. Fixing the refill_thread to allow testing was painful. It took #18 and two more fixes: b6753f4 and d685dc5 ā glad to see the broken threading removed.
@zuckschwerdt I have rebased the streaming_improvements
branch over the current master, so on top of CS12 support.
I'll soon test. First notes: I'd expect the rx/tx streams to be independed. I.e. if I need two setupStream()
calls to open streams I would not expect one closeStream()
to remove both?
@zuckschwerdt You are right. I fixed this, plus replaced shared_ptr
by unique_ptr
because those streamers don't escape the parent object and should not, so shared_ptr
is overkill or merely misleading here.
The Pluto is advertised as full-duplex. I don't know how much of that is true, perhaps independent control of both streams would be possible. E.g. running different rx/tx threads, with start/stop at different times.
Indeed. So I made a change to return 2 distinct SoapySDR::Stream* for RX and TX.
E.g. running different rx/tx threads, with start/stop at different times.
If you follow that path, you'll fall into the rabbit hole and soon realize that everything have to be locked by the same device_mutex
that have to lock both streaming and settings.
I've tried to put std::lock_guard<std::mutex> lock(device_mutex);
into readStream
but it had noticable performance impact.
The coward solution is to let readStream
unprotected. What about spin-locks though, because we only need to lock when "changing settings" and "streaming" are actually concurrent, which is not very often in practice.
Unfortunatly they are not standard issue in C++11, but strangely it provides the std::atomic_flag
construct:
https://en.cppreference.com/w/cpp/atomic/atomic_flag, here complete with a spin-lock example.
So in the last commit, I made my own pluto_spin_mutex
to use by device_mutex
itself protecting everything.
readStream
seems OK as the "unprotected" version, changing settings though seem slower. But I think this is not an issue.
Lets test and see how fast we can get with this safe version.
Part of the headache I had with the threading was concurrent access to stop() and recv(). There might still be a thread draining the buffer when stop() is called, and killing the buffer is a good way to crash while recv() is underway. Also something to look out for. Perhaps removing buf can be defered to the destructor. But that also means once buf is set up it can't safely be changed.
There might still be a thread draining the buffer when stop()
Yes that is the mean issue. With the "lock everything" strategy, a xxxStream
operation running can no longer be interrupted by a closeStream
or similar, so the crash can't happen.
In the inverse situation if a closeStream
is underway, the xxxxStream
op will have to wait to proceed.
The only thing we need to assure is that xxxxStream
op will cleanly do nothing when the terminating closeStream
release the lock.
Basically Settings and Streaming ops get interleaved properly, so I think we may change any buffer sizing at runtime and it would still work.
Sorry, I meant that with a lockless or atomic variant. With the mutex lock it guards nicely, but avoiding a mutex would probably yield better performance.
Well it is a called a mutex but here is just a CAS loop, which will almost always succeed the first time anyway because the actual contention is rare. I don't think we'll see any runing performance impact on streaming itself compared to no lock at all.
The only visible effect would be that a changing setting will wait for the current streaming operation to complete to take effect. At that particular time the spin-lock is likely to be slower than anything else because of its wasting of CPU cycles.
RX and TX are indeed quite independent, so I've made changes to have different RX and TX locks, plus some other cleanups.
Now we could have concurrent readStream
and writeStream
in theory.
I think I fixed something in the process:
Previously when I was doing Start/Stop device in Cubic, on SoapyRemote connexion on localhost, it would crash both Cubic and the SoapySDRServer
. Now it works.
Works very well and fixed spurious crash on start/stop for me.
Neither Clang-9 nor GCC-8 recognise std::make_unique
, strangely enough not even in C++14 mode. I had to change that to
this->rx_stream = std::unique_ptr<rx_streamer>(new rx_streamer(rx_dev, streamFormat, channels, args));
and
this->tx_stream = std::unique_ptr<tx_streamer>(new tx_streamer(tx_dev, streamFormat, channels, args));
which I gather are equivalent, just not so nice looking.
My bad, std::make_unique
is indeed C++14, let's stick on C++11 then.
Tested and works well for me. Can be merged (squash merge please) when you see fit, in my opinion.
Alea jacta est !
Hello Charles, @zuckschwerdt and @guruofquality.
Thanks to suggestions and tests by @zuckschwerdt, I've incremented them by opening this PR to add improvements of my own together with the others. @guruofquality I don't know if @zuckschwerdt can push on this PR ?
Anyway, here are the improvements :
iio_buffers
has their own buffer management.int16_t
into the destination format and don't neediio_channel_convert
at all.Of course, this is a work in progress:
refill_buffer
op turns out to be too slow. I hope not.