Closed kouchy closed 2 years ago
Hi @kouchy ,
Thanks for getting in touch. Good timing for these questions, as I'm working on benchmarking the project's BCH and LDCP decoders while comparing them against the aff3ct implementation. I have started with the BCH decoder, and my WIP is here: https://github.com/igorauad/gr-dvbs2rx/tree/aff3ct. I might have some time to work a bit more on this task over the weekend.
So far, it seems to me the aff3ct implementation is faster. However, it uses too much memory. I tried the std, fast, and genius implementations, but I think this member is overusing memory: https://github.com/aff3ct/aff3ct/blob/master/src/Module/Decoder/BCH/Standard/Decoder_BCH_std.cpp#L18 I haven't looked into the implementation carefully yet. Planning to do that soon. I'd assume elp stands for "error location polynomial"? Is it really necessary to store N vectors of size N?
I was able to test BCH codes used with short FECFRAMEs, which I suppose are within GF(2^14). However, I couldn't test the BCH configurations corresponding to normal FECFRAMEs.
In the previously published paper, we said that gr-dvbs2rx was not designed for high throughput (is it really true?)
I'll let @drmpeg comment on the original intentions. My goal when working to make gr-dvbs2rx a fully-functional receiver was mainly to test it with the Blockstream Satellite signal. We've been working with the LeanDVB implementation and one of the goals was to make gr-dvbs2rx more efficient than leandvb at some point. At the moment, the two implementations have comparable CPU usage, but I continue to work on gr-dvbs2rx. It seems the BCH decoder is fairly inefficient at the moment and not using SIMD (unlike the LDPC decoder). Hence, that is why I'm investigating the substitution with the aff3ct's BCH decoder. I think that would be the easiest CPU gain right now.
That being said, I should note the Blockstream Satellite signal is only 1 Mbaud. So it's not a relatively wide band DVB-S2 carrier. Nevertheless, having a fast implementation is absolutely a design goal for us. Ideally, we would run the DVB-S2 receiver with an RTL-SDR on a Raspberry Pi and still run other heavy applications simultaneously. We can sort of do that now, but we need to tune the max number of LDPC iterations down.
how can I bench the throughput and the latency of the Rx? Do you have some advise? I guess I don't care to use real radio for this comparison, I just want to bench the digital receiver part, so preregistered samples should be ok.
The simple way I can think of would be measuring the time it takes to decode an IQ recording. You can make an IQ recording using dvbs2-tx
if you don't have one (I was also planning to upload some). Then, you can decode the IQ file with dvbs2-rx --source file --in-file xxx
.
it has been some time I follow GNU radio but I'm still a newbie and I don't really understand if GNU Radio uses multi-threading (I think so) and if yes, how it performs multi-threading?
I think @drmpeg will know better. But yes, GNU Radio is based on multi-threading. Each block spawns a thread. So, in this project's implementation, the following blocks have independent threads:
See the Rx pipeline at: https://github.com/igorauad/gr-dvbs2rx/blob/master/apps/dvbs2-rx#L606
I don't know much more details about the multi-threading implementation. I'd assume you can find enough info on the GNU Radio project about the scheduler. Also, I'd imagine there are GRCon talks available on YouTube.
For instance, is the granularity of the GNU Radio multi-threading can be inside the DVB-S2 Rx Hierarchical Block or not? (if not, the AFF3CT approach could be complementary to GNU Radio)
Note the main Rx app (dvbs2-rx) does not using the hierarchical block. The hierarchical block is just a convenient wrapper for the example flowgraphs, which are more tailored to experimentation than production usage. In contrast, on the dvbs2-rx
app, we instantiate the blocks individually. I decided this is better because it allows for a lot more flexibility. For example, you can choose two distinct implementations for the symbol synchronizer from the command line (see https://github.com/igorauad/gr-dvbs2rx/blob/master/apps/dvbs2-rx#L665). Also, the user may choose to stop the pipeline after the BBFRAME descrambler while outputting the descrambled BBFRAME stream instead of the MPEG TS stream (see https://github.com/igorauad/gr-dvbs2rx/blob/master/apps/dvbs2-rx#L640).
I think the best way to use aff3ct on this project is by replacing the LDPC decoder and/or BCH decoder. That's what I'm planning to do if I find that the aff3ct implementation is faster.
The benchmarking apps are compiled with option BENCHMARK_FEC
discussed here: https://github.com/igorauad/gr-dvbs2rx/blob/aff3ct/docs/installation.md#build-options
Do you have some advice to compile the best version (in term of highest possible throughput) of the gr-dvbs2rx?
Definitely enable NATIVE_OPTIMIZATIONS
. It is off by default because that makes the implementation more portable, especially when building binary packages. The best SIMD implementation for the LDPC decoder is decided in runtime, see here: https://github.com/igorauad/gr-dvbs2rx/blob/master/lib/ldpc_decoder_cb_impl.cc#L651. Hence, if you compile a package on a machine with AVX2 and run the package on another machine without AVX2, it will still work. In contrast, if you enable the NATIVE_OPTIMIZATIONS
option, the project will be compiled with march=native
and will only work on your machine. I believe that is what you want.
On which platform do you think it performs better (AMD AVX2 or Intel AVX-512 servers)?
There is no support for AVX-512 at the moment. AVX2 is the best available SIMD instruction set.
At the beginning AFF3CT has been designed to be an ECC toolbox (with fast decoder implementations). After that it grows and the need to make SDR systems arrived... At the end we don't want to make a clone of GNU Radio : it will be a non-sense.
Thanks for the great work on aff3ct. We've been following and using it for a long time, since our initial version Blockstream Satellite v1.0 (still available at https://github.com/Blockstream/gr-blocksat, but no longer used).
I'd vote for having aff3ct as efficient as possible on ECC implementations (both in CPU and memory) while letting GNU Radio do the rest :) Also, I wonder if aff3ct does runtime detection of SIMD capabilities or defines that in compilation time. Could you confirm?
I was also aware of your aff3ct/dvbs2
project. However, I've never had a chance to run it. Would it support QPSK 3/5 with normal FECFRAMEs and pilot symbols? Seems like another nice alternative for Blockstream Satellite.
Cheers
Hi @igorauad,
Thank you very much for this complete answer, it is well appreciated!
I did not work a lot on optimizing the BCH decoder. To be honest it is also one of the limiting factor in our implementation... So, do not expect big improvement compared to the BCH decoder you are using in this project. I did not work personally on it and as far as I know it is a modified version of the Morelos-Zaragoza decoder's (the original version can be found here: http://www.eccpage.com/). We asked Morelos-Zaragoza permission to integrate it into the AFF3CT toolbox. We did not tried to reduce its memory footprint so I guess you might be right when you say it consumes too much memory...
However, I worked on efficient implementations of a generic demodulator and LDPC decoder, these implementations could be faster that the ones your are using (maybe). You can see some measured throughput performances in Table II in the paper we published (https://hal.archives-ouvertes.fr/hal-03336450/file/article.pdf).
Thank you for the links to the source code, it is helpful :-). For what I understand, GNURadio spans a pipeline stage for each blocks. In the AFF3CT DSEL (~= runtime) we propose a different approach:
What I guess is that there is interesting methods in our approach, if it is confirmed we could then transpose these methods to the GNURadio runtime. An other important aspect of the AFF3CT DSEL is to propose loop and branch (if, switch) mechanisms. I'm not able to see how to do that in the GNURadio Companion. Do you think is it possible to model loops in GNURadio flow graphs?
In our DVB-S2 use case we are targeting many-core CPUs, this type of machines have a lot of memory so in our case the memory was not a constraint... However, in future work I would like to focus on low power systems (with less memory).
To answer your questions :
I'd assume elp stands for "error location polynomial"? Is it really necessary to store N vectors of size N?
I don't know :-/.
Also, I wonder if aff3ct does runtime detection of SIMD capabilities or defines that in compilation time. Could you confirm?
In AFF3CT the SIMD capabilities are defined at the compilation time. There is no runtime at this time.
I was also aware of your aff3ct/dvbs2 project. However, I've never had a chance to run it. Would it support QPSK 3/5 with normal FECFRAMEs and pilot symbols? Seems like another nice alternative for Blockstream Satellite.
At this time, if you use directly the project without doing any code modification, the AFF3CT DVB-S2 Tx/Rx only supports 3 MODCODs (QPSK 3/5, QPSK 8/9 and 8-PSK 8/9 with small frames (16200 bits for LDPC codewords)). Yes it supports pilot symbols (@rtajan can you confirm?). However with some minor modifications in the source code, the project should be able to support more MODCODs (there is, to my point of view, no limitation for this).
Best.
Hi @igorauad,
I made some benchmarks on an Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz CPU (8 cores) on error free zone.
I compiled the code with -DNATIVE_OPTIMIZATIONS=ON
.
Here is the command I used to run the Rx:
dvbs2-rx --modcod qpsk8/9 --frame-size short --source file --in-file samples.iq --sink file --out-file out.ts
I obtained an information throughput of 1.64 Mb/s.
Do you think this is near the expected throughput or did I missed something?
When I add --log-stats
to the command line, the following log is repeated:
gr::log 2022-03-18 15:45:02,530 :INFO: {'lock': True, 'snr': 26.21209716796875, 'plsync': {'coarse_freq_corr': True, 'freq_offset_hz': 0.0009667706635241302, 'frame_count': {'processed': 8670, 'rejected': 0, 'dummy': 0}, 'locked_since': '2022-03-18T15:43:51.523712'}, 'fec': {'frames': 8640, 'errors': 0, 'fer': 0.0, 'avg_ldpc_trials': 0}, 'mpeg-ts': {'packets': 81298, 'per': 0.0}}
Thank you in advance.
Hi @kouchy ,
Apologies for the delay in replying your earlier message.
as I know it is a modified version of the Morelos-Zaragoza decoder's (the original version can be found here: http://www.eccpage.com/).
Thanks for the link. I will check it out.
However, I worked on efficient implementations of a generic demodulator and LDPC decoder, these implementations could be faster than the ones you are using (maybe). You can see some measured throughput performances in Table II in the paper we published (https://hal.archives-ouvertes.fr/hal-03336450/file/article.pdf).
Very interesting, thanks for sharing. Yes, there is a lot of room for improvement here in terms of demodulation and decoding.
The block that I worked on the most is the PL Sync (low-PHY frame/frequency/phase recovery) and the symbol synchronizer. The former relies heavily on libvolk and is quite fast. The latter is at least a lot faster than the in-tree symbol synchronizer block from GNU Radio. However, the synchronizer is still one of the most expensive blocks, because it processes samples (oversampled sequence), not symbols, and because it is a bit hard to vectorize (it is a feedback loop). I'm still planning the improve it further by making better use of SIMD (e.g., direct SIMD implementation instead of calling Volk) and by trading estimation accuracy for lower CPU usage. But I'll only do so after I work on BCH, since, as I said, BCH performance is now the lowest hanging fruit to achieve better CPU usage.
Thank you for the links to the source code, it is helpful :-). For what I understand, GNURadio spans a pipeline stage for each blocks. In the AFF3CT DSEL (~= runtime) we propose a different approach:
Unfortunately, I don't understand the details of the GNU Radio scheduler. @marcusmueller might be able to help you and indicate where to look for further details in how the GR approach differs from yours. Also, perhaps check out his talk here: https://www.youtube.com/watch?v=cTGxhsSvZ9c. Marcus, is this still a relatively up-to-date talk or is there any more recent material?
--
Now, regarding your second comment:
I obtained an information throughput of 1.64 Mb/s. Do you think this is near the expected throughput or did I missed something?
The main parameter I think you are missing is the --sym-rate
or -s
for short. When reading the IQ recording from a file, gr-dvbs2rx simulates the real-time throughput of a regular receiver running at a specific symbol rate. Although, I recognize that maybe it would be better to make this an option disabled by default. The throttling happens here: https://github.com/igorauad/gr-dvbs2rx/blob/master/apps/dvbs2-rx#L474. So try with a very high symbol rate and see how it goes.
A throughput of 1.64 Mbps does seem low. For QPSK 8/9, the spectral efficiency is 1.766451 bits/sec/Hz, so 1.64 Mbps would be achievable with roughly 930 kbaud only. I've been running gr-dvbs2rx at 1 Mbaud in multiple machines with relatively low CPU usage. Hence, it doesn't seem like 930 kbaud is a reasonable limit.
Also, I don't know how fast is the file source block used when the IQ source (option --source
) is set to file
. In my tests at 1 Mbaud, I'm using the RTL-SDR source, so the file source block is not used (see here: https://github.com/igorauad/gr-dvbs2rx/blob/master/apps/dvbs2-rx#L450). The other alternative would be the file descriptor source, which is used with --source=fd
(the default). For example, you can try:
cat samples.iq | dvbs2-rx --modcod qpsk8/9 --frame-size short --sink file --out-file out.ts
However, I suspect this won't make a difference. And I'm hoping the file input is not a bottleneck. But since you are comparing with the aff3ct receiver, just bear this interface in mind.
Also, on the PL Sync block, there is a minor optimization when --pilots
is set to on
/off
instead of auto
. The difference is that, in this case, the block does not need to decode the PLSC. It already has the PLSC a priori, so it only needs to search the frame location. However, the CPU usage difference will be minimal, since this is not a very expensive computation anyway.
Of course, as you surely know, everything depends on the SNR. If the IQ recording you have has low SNR, then there will be more LDPC iterations, and the PL Sync block would possibly do more work if it ever loses frame sync. However, as far as I can tell, the IQ recording you are using is pretty clean, as I see 26 dB SNR on the --log-stats
log. So that would not be the reason for the low performance.
For your reference, here is the CPU usage printed by top
for the dvbs2-rx threads when I read an IQ file. Note how BCH is currently the bottleneck.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18609 root 20 0 1796444 128256 67424 R 99.7 0.8 0:25.64 bch_decoder_bb1
18608 root 20 0 1796444 128256 67424 R 38.9 0.8 0:15.62 ldpc_decoder_c1
18606 root 20 0 1796444 128256 67424 S 13.0 0.8 0:03.40 symbol_sync_cc1
18595 root 20 0 1796444 128256 67424 S 8.3 0.8 0:02.22 deinterleave2
18604 root 20 0 1796444 128256 67424 S 5.6 0.8 0:01.46 agc_cc16
18607 root 20 0 1796444 128256 67424 S 5.3 0.8 0:01.45 plsync_cc19
18605 root 20 0 1796444 128256 67424 S 3.3 0.8 0:00.97 rotator_cc17
18602 root 20 0 1796444 128256 67424 S 3.0 0.8 0:00.72 float_to_comple
18600 root 20 0 1796444 128256 67424 S 2.3 0.8 0:00.57 add_const_ff5
18601 root 20 0 1796444 128256 67424 S 2.3 0.8 0:00.59 multiply_const_
18603 root 20 0 1796444 128256 67424 S 2.3 0.8 0:00.65 throttle10
18596 root 20 0 1796444 128256 67424 S 2.0 0.8 0:00.48 uchar_to_float4
18597 root 20 0 1796444 128256 67424 S 2.0 0.8 0:00.53 add_const_ff6
18598 root 20 0 1796444 128256 67424 S 1.7 0.8 0:00.55 multiply_const_
18599 root 20 0 1796444 128256 67424 S 1.7 0.8 0:00.49 uchar_to_float3
18594 root 20 0 1796444 128256 67424 S 1.3 0.8 0:00.45 file_descriptor
18610 root 20 0 1796444 128256 67424 S 0.7 0.8 0:00.15 bbdescrambler_1
18611 root 20 0 1796444 128256 67424 S 0.3 0.8 0:00.10 bbdeheader_bb15
At some point, I would like to make a script to automate the process of running dvbs2-rx
and measuring both the number of recovered/decoded bytes and the time it took to process the IQ recording. Just a simple script, but I haven't had the time to implement it yet.
Lastly, if you spend some time profiling and searching where dvbs2-rx is slowing down the most, I would be very interested in the results.
Thanks a lot for sharing your results.
Hi @igorauad,
Thank you very much for your precise and exhaustive answers.
I removed the throttling for the benchmarks, now I obtain an information throughput of 7.4 Mb/s, it is better :-).
Thx for the link to the talk, I listened this talk before and I'm now aware of the GNURadio newsched
project. I found that many of the ideas and conclusions are similar with what we did in the AFF3CT DSEL+runtime. To the best of my knowledge there is also different approaches that I will detail in the paper I'm writing.
However, I obtain different block performance, I'm still in an error free zone and here is my top
output :
top - 12:00:30 up 2 days, 1:10, 3 users, load average: 1,74, 0,67, 0,52
Threads: 473 total, 4 running, 469 sleeping, 0 stopped, 0 zombie
%Cpu(s): 47,7 us, 12,2 sy, 0,0 ni, 39,7 id, 0,0 wa, 0,0 hi, 0,4 si, 0,0 st
MiB Mem : 31893,8 total, 10461,7 free, 644,1 used, 20787,9 buff/cache
MiB Swap: 32768,0 total, 32714,2 free, 53,8 used. 30747,4 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
59159 root 20 0 1823240 138576 76736 R 99,0 0,4 0:10.10 deinterleave2
59163 root 20 0 1823240 138576 76736 S 50,2 0,4 0:05.10 uchar_to_float3
59160 root 20 0 1823240 138576 76736 S 49,8 0,4 0:05.07 uchar_to_float4
59172 root 20 0 1823240 138576 76736 R 42,5 0,4 0:04.34 bch_decoder_bb1
59169 root 20 0 1823240 138576 76736 S 38,5 0,4 0:03.91 symbol_sync_cc1
59161 root 20 0 1823240 138576 76736 S 36,5 0,4 0:03.78 add_const_ff6
59164 root 20 0 1823240 138576 76736 S 36,2 0,4 0:03.64 add_const_ff5
59166 root 20 0 1823240 138576 76736 R 34,9 0,4 0:03.54 float_to_comple
59162 root 20 0 1823240 138576 76736 S 29,2 0,4 0:03.01 multiply_const_
59165 root 20 0 1823240 138576 76736 S 28,6 0,4 0:02.87 multiply_const_
59171 root 20 0 1823240 138576 76736 S 20,9 0,4 0:02.12 ldpc_decoder_c1
59167 root 20 0 1823240 138576 76736 S 15,6 0,4 0:01.62 agc_cc15
59170 root 20 0 1823240 138576 76736 S 11,0 0,4 0:01.13 plsync_cc18
59168 root 20 0 1823240 138576 76736 S 8,0 0,4 0:00.81 rotator_cc16
59158 root 20 0 1823240 138576 76736 S 2,3 0,4 0:00.21 file_source1
59173 root 20 0 1823240 138576 76736 S 1,0 0,4 0:00.10 bbdescrambler_1
59174 root 20 0 1823240 138576 76736 S 1,0 0,4 0:00.10 bbdeheader_bb14
59175 root 20 0 1823240 138576 76736 S 0,7 0,4 0:00.06 file_sink10
It is weird to have deinterleave2
, uchar_to_float3
and uchar_to_float4
consuming that much, isn't it?
I guess there is something wrong. Normally the cost of the deinterleave2
block should be almost nothing with the QPSK R=8/9 MODCOD (according to the standard there is no interleaving). If I could find what is wrong I guess the BCH will be the limiting factor and the throughput will be more than doubled...
Do you have an idea?
Thank you again for your time and your help!
Bests.
Hi @kouchy
I removed the throttling for the benchmarks, now I obtain an information throughput of 7.4 Mb/s, it is better :-).
I'm glad the performance improved a bit :) However, it is not that high yet, though. I'm hoping you can achieve something a little faster.
To the best of my knowledge there is also different approaches that I will detail in the paper I'm writing.
Nice!
However, I obtain different block performance, I'm still in an error free zone and here is my top output :
I see. So, because you are operating error-free, LDPC and BCH do not consume as much. In the top
output I sent earlier, I have around 8 dB SNR, so the codes are doing some work. Btw, if you are using dvbs2-tx
to generate the IQ file, you can simulate noise by running it with option --snr
.
It is weird to have deinterleave2, uchar_to_float3 and uchar_to_float4 consuming that much, isn't it?
To be clear, this deinterleave block has nothing to do with the bit deinterleaver used in DVB-S2 for 8PSK and beyond. Instead, it is used in the pipeline responsible for converting the IQ format when reading an IQ file or receiving via a file descriptor. The blocks you mentioned (deinterleave and uchar_to_float) are not used at all when receiving IQ samples via an RTL-SDR or USRP.
The assumption is that the IQ file (or analogous input via file descriptor) has I and Q samples represented by interleaved chars. However, the flowgraph processes complex numbers (type gr_complex
), not chars, so they must be converted back to a complex stream. See the pipeline here: https://github.com/igorauad/gr-dvbs2rx/blob/master/apps/dvbs2-rx#L466
I guess these blocks were simply not implemented for speed. You could get rid of them and try again.
One way to get rid of these blocks is by saving the IQ file in a different format. It seems interleaved I/Q chars is the typical format, but it is not mandatory. The Tx side implements the opposite to convert complex numbers to I/Q chars, see here: https://github.com/igorauad/gr-dvbs2rx/blob/master/apps/dvbs2-tx#L101
The pipeline is like so:
In the code, it is:
complex_to_float_0 = blocks.complex_to_float(1)
multiply_const_0 = blocks.multiply_const_ff(128)
multiply_const_1 = blocks.multiply_const_ff(128)
add_const_0 = blocks.add_const_ff(127)
add_const_1 = blocks.add_const_ff(127)
float_to_uchar_0 = blocks.float_to_uchar()
float_to_uchar_1 = blocks.float_to_uchar()
interleaver = blocks.interleave(gr.sizeof_char, 1)
This chain could be entirely bypassed if the file sink (here) took a complex input instead of a char input. That is, with a change like the following on dvbs2-tx
:
@@ -98,29 +98,14 @@ class dvbs2_tx(gr.top_block):
# Convert the complex IQ stream into an interleaved uchar stream.
throttle = blocks.throttle(gr.sizeof_gr_complex, self.samp_rate,
True)
- complex_to_float_0 = blocks.complex_to_float(1)
- multiply_const_0 = blocks.multiply_const_ff(128)
- multiply_const_1 = blocks.multiply_const_ff(128)
- add_const_0 = blocks.add_const_ff(127)
- add_const_1 = blocks.add_const_ff(127)
- float_to_uchar_0 = blocks.float_to_uchar()
- float_to_uchar_1 = blocks.float_to_uchar()
- interleaver = blocks.interleave(gr.sizeof_char, 1)
if (self.sink == "fd"):
file_or_fd_sink = blocks.file_descriptor_sink(
- gr.sizeof_char, self.out_fd)
+ gr.sizeof_gr_complex, self.out_fd)
else:
- file_or_fd_sink = blocks.file_sink(gr.sizeof_char,
+ file_or_fd_sink = blocks.file_sink(gr.sizeof_gr_char,
self.out_file)
- self.connect((throttle, 0), (complex_to_float_0, 0))
- self.connect((complex_to_float_0, 0), (multiply_const_0, 0))
- self.connect((complex_to_float_0, 1), (multiply_const_1, 0))
- self.connect((multiply_const_0, 0), (add_const_0, 0),
- (float_to_uchar_0, 0), (interleaver, 0))
- self.connect((multiply_const_1, 0), (add_const_1, 0),
- (float_to_uchar_1, 0), (interleaver, 1))
- self.connect((interleaver, 0), (file_or_fd_sink, 0))
+ self.connect((throttle, 0), (file_or_fd_sink, 0))
# First block on the pipeline
sink = throttle
elif (self.sink == "usrp"):
Correspondingly, on dvbs2-rx
, you can make the following changes:
@@ -457,34 +457,15 @@ class DVBS2RxTopBlock(gr.top_block, Qt.QWidget):
if (self.source == "fd" or self.source == "file"):
if (self.source == "fd"):
blocks_file_or_fd_source = blocks.file_descriptor_source(
- gr.sizeof_char, self.in_fd, False)
+ gr.sizeof_gr_complex, self.in_fd, False)
else:
blocks_file_or_fd_source = blocks.file_source(
- gr.sizeof_char, self.in_file, self.in_repeat)
+ gr.sizeof_gr_complex, self.in_file, self.in_repeat)
# Pipeline to convert the fd/file IQ stream into a complex stream,
# assuming the independent I and Q are uint8_t streams.
- blocks_deinterleave = blocks.deinterleave(gr.sizeof_char, 1)
- blocks_uchar_to_float_0 = blocks.uchar_to_float()
- blocks_uchar_to_float_1 = blocks.uchar_to_float()
- blocks_add_const_ff_0 = blocks.add_const_ff(-127)
- blocks_add_const_ff_1 = blocks.add_const_ff(-127)
- blocks_multiply_const_ff_1 = blocks.multiply_const_ff(1 / 128)
- blocks_multiply_const_ff_0 = blocks.multiply_const_ff(1 / 128)
- blocks_float_to_complex_0 = blocks.float_to_complex(1)
blocks_throttle_0 = blocks.throttle(gr.sizeof_gr_complex,
self.samp_rate, True)
- self.connect((blocks_file_or_fd_source, 0),
- (blocks_deinterleave, 0))
- self.connect(
- (blocks_deinterleave, 0), (blocks_uchar_to_float_0, 0),
- (blocks_add_const_ff_0, 0), (blocks_multiply_const_ff_0, 0),
- (blocks_float_to_complex_0, 0))
- self.connect(
- (blocks_deinterleave, 1), (blocks_uchar_to_float_1, 0),
- (blocks_add_const_ff_1, 0), (blocks_multiply_const_ff_1, 0),
- (blocks_float_to_complex_0, 1))
- self.connect((blocks_float_to_complex_0, 0),
- (blocks_throttle_0, 0))
+ self.connect((blocks_file_or_fd_source, 0), (blocks_throttle_0, 0))
source = blocks_throttle_0
elif (self.source == "rtl"):
See if that helps :)
Hi @igorauad,
Thx for you help again, it is well appreciated :-).
I forked the project and I applied the modifications on the thr_benchmark
branch (https://github.com/kouchy/gr-dvbs2rx/tree/thr_benchmark).
On the previous CPU (Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz CPU (8 cores)) I obtain a throughput of 17 Mbps!
I also ran the code on the server (2x Xeon Platinum 8168 @ 2.7GHz) we are using for the paper and here are the results :
top - 15:58:48 up 29 min, 2 users, load average: 1.90, 3.29, 3.80
Threads: 790 total, 3 running, 542 sleeping, 0 stopped, 0 zombie
%Cpu(s): 7.5 us, 0.6 sy, 0.0 ni, 91.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13163666+total, 9350700 free, 1312000 used, 12097396+buff/cache
KiB Swap: 8388604 total, 8388604 free, 0 used. 12925460+avail Mem
Throughput in Mbps
RES SHR S %CPU %MEM TIME+ COMMAND `gr_dvbs2rx` AFF3CT
136812 75856 R 99.7 0.1 0:10.59 bch_decoder_bb4 14.9 6.9
136812 75856 S 88.2 0.1 0:09.42 symbol_sync_cc9 16.9 ?.?
136812 75856 R 64.8 0.1 0:06.95 ldpc_decoder_cb 23.0 164.2
136812 75856 S 62.2 0.1 0:06.60 agc_cc7 24.0 367.5
136812 75856 S 32.9 0.1 0:03.50 plsync_cc10 45.3 ?.?
136812 75856 S 15.5 0.1 0:01.64 rotator_cc8 96.1 ?.?
136812 75856 S 14.8 0.1 0:01.59 file_source1 100.7 431.8
136812 75856 S 6.6 0.1 0:00.66 bbdescrambler_b 225.8 ?.?
136812 75856 S 5.9 0.1 0:00.59 bbdeheader_bb6 252.5 91.1
136812 75856 S 4.3 0.1 0:00.46 file_sink2 346.5 1838.3
In error free zone your BCH decoder is faster than the AFF3CT BCH decoder :-). The info. throughput is 14.9 Mbps on this server.
Hi @kouchy
Very nice! That is good progress!
If I'm understanding correctly, the Aff3ct LDPC decoder is approximately 10x faster? That sounds appealing :)
Just curious, are you generating these results with a publicly-available tool? Is it part of aff3ct? Or custom-built for the paper experiments?
In error free zone your BCH decoder is faster than the AFF3CT BCH decoder :-).
Interesting that the gr-dvbs2rx BCH decoder is faster. I wonder what it will look like after an optimization round (including some SIMD). I'll report when I find a chance to work on it.
Thanks again
Hi @igorauad,
You're welcome.
Yes the AFF3CT LDPC decoder sounds like much more efficient (and the throughput could be even almost doubled on fixed 16-bit integers).
Just curious, are you generating these results with a publicly-available tool? Is it part of aff3ct? Or custom-built for the paper experiments?
This is custom-built for the paper experiments. For gr_dvbs2rx
it is an approximation. I know the throughput of the entire Rx because I know the execution time and the size of the input file (on the Tx side). Then, because it is a pipeline, I know that this throughput is the thoughput of the slowest block in gr_dvbs2rx
. After that, with the CPU occupancy (%CPU
column in top -H
), I can deduce the throughput of the other blocks (there is more than 10 HW cores on this machine, so the approximation should be not far from the reality). For AFF3CT, there is an integrated --sim-stats
option that performs the per block measurement.
Now I'm trying to fill the ?.?
. Could you help me to precise what are doing the following blocks, please?
symbol_sync_cc
: I guess this is the timing synchro (Gardner) and maybe the PSK demodulation?rotator_cc
: I guess this is the frequency synchro (Coarse + Fine L&R + Fine P/F) ?plsync_cc
: I guess this is the frame synchro ?
I'm not sure where you are performing the PSK demodulation, could you help me to find where please?
Interesting that the gr-dvbs2rx BCH decoder is faster. I wonder what it will look like after an optimization round (including some SIMD). I'll report when I find a chance to work on it.
Clearly I'm interested!
Thank you for your time.
Yes the AFF3CT LDPC decoder sounds like much more efficient (and the throughput could be even almost doubled on fixed 16-bit integers).
Nice. However, I should point out that the gr-dvbs2rx's LDPC decoder is not just a decoder. A more accurate name for it would be "XFECFRAME-to-BCH-codeword", which is what it does. It takes the XFECFRAME in, does constellation demapping, bit deinterleaving, then finally LDPC decoding. The output is the BCH codeword with byte-packed hard decisions, which goes into the BCH decoder block.
So this leads us to further clarifications about the blocks.
I'm not sure where you are performing the PSK demodulation, could you help me to find where please?
As explained above, it happens on the LDPC decoder block. At least for now. I plan to refactor this and separate the XFECFRAME/deinterleaving into its own block while leaving the LDPC decoder block a pure LDPC decoder (possibly with the aff3ct decoder).
symbol_sync_cc: I guess this is the timing synchro (Gardner) and maybe the PSK demodulation?
The symbol synchronizer does two things simultaneously: the root-raised cosine (RRC) matched filtering and symbol timing recovery using a Gardner timing error detector (TED) as you mentioned. As I mentioned before, I have developed this block because the in-tree one is too slow. However, it is still possible to alternate between the two using option --sym-sync-impl
on dvbs2-rx
. If curious, try with --sym-sync-impl in-tree
to see the difference.
The in-tree version is more generic. It supports multiple TEDs, fractional resampling rates, and multiple interpolators. In contrast, my implementation is a bit more focused. It only supports the Gardner TED and integer decimation ratios. My implementation does support multiple interpolators, but the fastest is definitely the polyphase interpolator that does joint RRC filtering and symbol timing recovery. If interested, I spent some time looking at these methods and derived most conclusions from the experiments in https://github.com/igorauad/symbol_timing_sync.
rotator_cc: I guess this is the frequency synchro (Coarse + Fine L&R + Fine P/F) ?
Not really. The rotator is just a really simple multiplication of the input by a exp(j*2*pi*fo)
complex exponential. It is responsible for correcting a frequency offset fo
, but it doesn't know which frequency offset to correct. The block that estimates the frequency offset is the PL Sync block, which then controls the rotator.
The reason why I added the rotator as a separate block is because I wanted to execute the frequency offset correction before the symbol timing recovery. The symbol timing recovery algorithm with the Gardner detector is robust to frequency offsets, but it is better to have it operating with low frequency offsets.
You can see below how the PL Sync block has a message port to the rotator. This is the port over which it continuously updates the frequency offset corrected by the rotator. Note also the symbol synchronizer lies in between.
plsync_cc: I guess this is the frame synchro ?
This block performs frame synchronization, coarse and fine frequency offset estimation, phase correction, PL descrambling, PLSC decoding, frame locking logic, and PL pilot removal. That is, it takes the noisy PLFRAMEs in and outputs the corresponding XFECFRAMEs. In other words, it does all the magic to output XFECFRAMEs reliably to the LDPC decoder block. You can find more information on the docstrings over the code. This block and its underlying modules are all reasonably well documented:
https://github.com/igorauad/gr-dvbs2rx/blob/master/include/dvbs2rx/plsync_cc.h#L19
https://github.com/igorauad/gr-dvbs2rx/blob/master/lib/plsync_cc_impl.h#L67
The noteworthy limitation about the current state of the PL Sync block is that it doesn't work well without PL pilots yet. I haven't had a chance to focus on pilotless operation yet. In the Blockstream Satellite system, we have pilots enabled, so pilot mode has been the focus.
Ok thx! It is clear now! It enables a fair comparison with the AFF3CT blocks. Here is what I obtain:
symbol_sync_cc = matched filter + synchro timing (Gardner) = t_4 + t_5 + t_6
rotator_cc = complex multiplication = t_7
plsync_cc = synchro frame + synchro freq + symbol descrambler + remove PLH + noise estimation = t_3 + t_8 + t_9 + t_10 + t_11 + t_12 + t_13
ldpc_decoder_cb = demap + deinterleave + decode = t_14 + t_15 + t_16
symbol_sync_cc [t_4+t_5+t_6]: l=6872.57 us - T=(16*14232)/l= 33.13 Mb/s
rotator_cc [t_7]: l= 332.18 us - T=(16*14232)/l=685.51 Mb/s
plsync_cc [t_3+t_8+t_9+t_10+t_11+t_12+t_13]: l=4709.84 us - T=(16*14232)/l= 48.35 Mb/s
ldpc_decoder_cb [t_14+t_15+t_16]: l=7182.10 us - T=(16*14232)/l= 31.71 Mb/s
--------------------------------
top - 15:58:48 up 29 min, 2 users, load average: 1.90, 3.29, 3.80
Threads: 790 total, 3 running, 542 sleeping, 0 stopped, 0 zombie
%Cpu(s): 7.5 us, 0.6 sy, 0.0 ni, 91.9 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13163666+total, 9350700 free, 1312000 used, 12097396+buff/cache
KiB Swap: 8388604 total, 8388604 free, 0 used. 12925460+avail Mem
Throughput in Mbps
RES SHR S %CPU %MEM TIME+ COMMAND `gr_dvbs2rx` AFF3CT
136812 75856 R 99.7 0.1 0:10.59 bch_decoder_bb4 14.9 6.9
136812 75856 S 88.2 0.1 0:09.42 symbol_sync_cc9 16.9 33.1
136812 75856 R 64.8 0.1 0:06.95 ldpc_decoder_cb 23.0 31.7
136812 75856 S 62.2 0.1 0:06.60 agc_cc7 24.0 367.5
136812 75856 S 32.9 0.1 0:03.50 plsync_cc10 45.3 48.4
136812 75856 S 15.5 0.1 0:01.64 rotator_cc8 96.1 685.5
136812 75856 S 14.8 0.1 0:01.59 file_source1 100.7 431.8
136812 75856 S 6.6 0.1 0:00.66 bbdescrambler_b 225.8 91.1
136812 75856 S 5.9 0.1 0:00.59 bbdeheader_bb6 252.5 -
136812 75856 S 4.3 0.1 0:00.46 file_sink2 346.5 1838.3
As you can see, I normalized the throughputs with the number of information bits.
The t_i
indexing refers to the tasks (~= GR blocks) in our previous paper (https://hal.archives-ouvertes.fr/hal-03336450/file/article.pdf).
Thanks, @kouchy .
Interesting to see the Aff3ct LDPC decoder continues to be faster even though you are now including the other stages (t14 and t15).
Also, I'm curious about your symbol synchronizer implementation. I haven't had a chance to read the paper. However, from a quick glance, I was under the impression that some stages (like the symbol synchronizer) are trained in the beginning over some frames (an acquisition phase) and then stop tracking the estimates. Is that how it works? Or do these blocks continue the work for as long as the simulation is running?
JFYI, the PL Sync and Symbol Sync implementations on gr-dvbs2rx track the symbol timing and carrier frequency/phase offsets continuously. They can't really stop doing so since the carrier frequency and sampling clock are changing all the time. The only part of the processing that is significantly reduced after an "acquisition phase" is the frame timing recovery. The latter is initially based on cross-correlation over the entire stream of samples. However, when the right timing is found (a cross-correlation peak), and the PLSC is decoded, the block knows where to expect the next cross-correlation peak, so it no longer needs to calculate the cross-corr over the entire stream. At this point, it only calculates the cross-corr on the next expected peak to ensure the peak is indeed observed. So this part of the processing starts with a relatively high CPU usage and quickly drops to a very low usage when the frame timing is locked.
Hi @igorauad ,
Also, I'm curious about your symbol synchronizer implementation. I haven't had a chance to read the paper. However, from a quick glance, I was under the impression that some stages (like the symbol synchronizer) are trained in the beginning over some frames (an acquisition phase) and then stop tracking the estimates. Is that how it works? Or do these blocks continue the work for as long as the simulation is running?
These blocks continue the work as long as the simulation is running (same as you).
JFYI, the PL Sync and Symbol Sync implementations on gr-dvbs2rx track the symbol timing and carrier frequency/phase offsets continuously. They can't really stop doing so since the carrier frequency and sampling clock are changing all the time. The only part of the processing that is significantly reduced after an "acquisition phase" is the frame timing recovery. The latter is initially based on cross-correlation over the entire stream of samples. However, when the right timing is found (a cross-correlation peak), and the PLSC is decoded, the block knows where to expect the next cross-correlation peak, so it no longer needs to calculate the cross-corr over the entire stream. At this point, it only calculates the cross-corr on the next expected peak to ensure the peak is indeed observed. So this part of the processing starts with a relatively high CPU usage and quickly drops to a very low usage when the frame timing is locked.
This is the same for us.
Thank you again for your time and your help @igorauad. It is sincerely very well appreciated. I think it enables a fair comparison (this is exactly what I wanted to do). I will keep you updated if the paper is accepted :-).
No worries, @kouchy . Happy to help!
Thanks for the reporting the interesting experiments and good luck with the paper!
Hi @kouchy ,
FYI, I have merged the following changes related to this issue:
23707d5381ca44e4fa87cb0e5109d5108c339d7c Changes the file input/output interface to std::complex<float>
format (called fc32) by default instead of interleaved u8. With that, the pipeline for IQ format conversion (discussed in https://github.com/igorauad/gr-dvbs2rx/issues/6#issuecomment-1075561205) is now disabled by default. It is only enabled with option --out-iq-format u8
on Tx and --in-iq-format u8
on Rx. For your experiments, the default is what you want.
8d95039acaaaaa3d97da814709e637661a32c47c Changes the behavior of the Rx app when reading IQ samples from a file or file descriptor. Now, by default, it will read as fast as possible. That is, it will not simulate the real-time symbol rate as discussed in https://github.com/igorauad/gr-dvbs2rx/issues/6#issuecomment-1072663180. To simulate the symbol rate, it is now necessary to run with option --in-real-time
.
Let me know if you have more questions. Otherwise, I will close the issue for now.
Thanks again.
Hi @igorauad,
Thx for keeping me updated. I think these are good modifications. I don't have other questions on this topic, at least for now.
I close the issue.
Best.
Hi @igorauad,
First, thank you and @drmpeg for this open source project.
I am currently writing a long paper on a different approach of a DVB-S2 transceiver (https://github.com/aff3ct/dvbs2). This transceiver is based on the AFF3CT DSEL. A short paper has been published on this DVB-S2 transceiver last year (https://hal.archives-ouvertes.fr/hal-03336450/file/article.pdf).
To the best of my knowledge, the
gr-dvbs2rx
is the most complete and open source DVB-S2 RX implem. In our solution we focus on achieving highest possible throughput by combining efficient/portable SIMD implementations and multi-threaded system (pipeline + fork/join parallelism). However we support less DVB-S2 configurations than this project, it is assumed :-). We are focusing on the method.In the previously published paper, we said that
gr-dvbs2rx
was not designed for high throughput (is it really true?) but I think it is too easy and I would like to really compare the efficiency of your implementation to be as fair as possible :-).So, here are my questions:
DVB-S2 Rx Hierarchical Block
or not? (if not, the AFF3CT approach could be complementary to GNU Radio)For now I'm playing with this type of command line (at this time, I only succeeded to run the code by running it on the Docker image you proposed) :
Do you have some advice to compile the best version (in term of highest possible throughput) of the
gr-dvbs2rx
? On which platform do you think it performs better (AMD AVX2 or Intel AVX-512 servers)?I guess it is a lot of questions but I think it could be a good time to merge some things between AFF3CT and GNU Radio. But, it is still unclear what to merge for me :-). At the beginning AFF3CT has been designed to be an ECC toolbox (with fast decoder implementations). After that it grows and the need to make SDR systems arrived... At the end we don't want to make a clone of GNU Radio : it will be a non-sense.
Any help would be very appreciated!
Thank you in advance.