Benchmarks and comparison

igorauad / gr-dvbs2rx

DVB-S2 Receiver Extensions for GNU Radio

https://igorauad.github.io/gr-dvbs2rx/

GNU General Public License v3.0

96 stars 23 forks source link

Benchmarks and comparison #6

Closed kouchy closed 2 years ago

kouchy commented 2 years ago

Hi @igorauad,

First, thank you and @drmpeg for this open source project.

I am currently writing a long paper on a different approach of a DVB-S2 transceiver (https://github.com/aff3ct/dvbs2). This transceiver is based on the AFF3CT DSEL. A short paper has been published on this DVB-S2 transceiver last year (https://hal.archives-ouvertes.fr/hal-03336450/file/article.pdf).

To the best of my knowledge, the gr-dvbs2rxis the most complete and open source DVB-S2 RX implem. In our solution we focus on achieving highest possible throughput by combining efficient/portable SIMD implementations and multi-threaded system (pipeline + fork/join parallelism). However we support less DVB-S2 configurations than this project, it is assumed :-). We are focusing on the method.

In the previously published paper, we said that gr-dvbs2rx was not designed for high throughput (is it really true?) but I think it is too easy and I would like to really compare the efficiency of your implementation to be as fair as possible :-).

So, here are my questions:

how can I bench the throughput and the latency of the Rx? Do you have some advise? I guess I don't care to use real radio for this comparison, I just want to bench the digital receiver part, so preregistered samples should be ok.
it has been some time I follow GNU radio but I'm still a newbie and I don't really understand if GNU Radio uses multi-threading (I think so) and if yes, how it performs multi-threading? For instance, is the granularity of the GNU Radio multi-threading can be inside the DVB-S2 Rx Hierarchical Block or not? (if not, the AFF3CT approach could be complementary to GNU Radio)
For now I'm playing with this type of command line (at this time, I only succeeded to run the code by running it on the Docker image you proposed) :
```
cat example.ts | dvbs2-tx --modcod qpsk8/9 --frame short | dvbs2-rx --modcod qpsk8/9 --frame short --out-fd 3 3> /dev/null
```
Do you have some advice to compile the best version (in term of highest possible throughput) of the gr-dvbs2rx? On which platform do you think it performs better (AMD AVX2 or Intel AVX-512 servers)?

I guess it is a lot of questions but I think it could be a good time to merge some things between AFF3CT and GNU Radio. But, it is still unclear what to merge for me :-). At the beginning AFF3CT has been designed to be an ECC toolbox (with fast decoder implementations). After that it grows and the need to make SDR systems arrived... At the end we don't want to make a clone of GNU Radio : it will be a non-sense.

Any help would be very appreciated!

Thank you in advance.

igorauad commented 2 years ago

Hi @kouchy ,

Thanks for getting in touch. Good timing for these questions, as I'm working on benchmarking the project's BCH and LDCP decoders while comparing them against the aff3ct implementation. I have started with the BCH decoder, and my WIP is here: https://github.com/igorauad/gr-dvbs2rx/tree/aff3ct. I might have some time to work a bit more on this task over the weekend.

So far, it seems to me the aff3ct implementation is faster. However, it uses too much memory. I tried the std, fast, and genius implementations, but I think this member is overusing memory: https://github.com/aff3ct/aff3ct/blob/master/src/Module/Decoder/BCH/Standard/Decoder_BCH_std.cpp#L18 I haven't looked into the implementation carefully yet. Planning to do that soon. I'd assume elp stands for "error location polynomial"? Is it really necessary to store N vectors of size N?

I was able to test BCH codes used with short FECFRAMEs, which I suppose are within GF(2^14). However, I couldn't test the BCH configurations corresponding to normal FECFRAMEs.

In the previously published paper, we said that gr-dvbs2rx was not designed for high throughput (is it really true?)

I'll let @drmpeg comment on the original intentions. My goal when working to make gr-dvbs2rx a fully-functional receiver was mainly to test it with the Blockstream Satellite signal. We've been working with the LeanDVB implementation and one of the goals was to make gr-dvbs2rx more efficient than leandvb at some point. At the moment, the two implementations have comparable CPU usage, but I continue to work on gr-dvbs2rx. It seems the BCH decoder is fairly inefficient at the moment and not using SIMD (unlike the LDPC decoder). Hence, that is why I'm investigating the substitution with the aff3ct's BCH decoder. I think that would be the easiest CPU gain right now.

That being said, I should note the Blockstream Satellite signal is only 1 Mbaud. So it's not a relatively wide band DVB-S2 carrier. Nevertheless, having a fast implementation is absolutely a design goal for us. Ideally, we would run the DVB-S2 receiver with an RTL-SDR on a Raspberry Pi and still run other heavy applications simultaneously. We can sort of do that now, but we need to tune the max number of LDPC iterations down.

how can I bench the throughput and the latency of the Rx? Do you have some advise? I guess I don't care to use real radio for this comparison, I just want to bench the digital receiver part, so preregistered samples should be ok.

The simple way I can think of would be measuring the time it takes to decode an IQ recording. You can make an IQ recording using dvbs2-tx if you don't have one (I was also planning to upload some). Then, you can decode the IQ file with dvbs2-rx --source file --in-file xxx.

it has been some time I follow GNU radio but I'm still a newbie and I don't really understand if GNU Radio uses multi-threading (I think so) and if yes, how it performs multi-threading?

I think @drmpeg will know better. But yes, GNU Radio is based on multi-threading. Each block spawns a thread. So, in this project's implementation, the following blocks have independent threads:

AGC
Rotator (frequency correction)
Symbol synchronizer
DVB-S2 PL Sync (low PHY)
LDPC decoder
BCH decoder
BBFRAME descrambler
BBFRAME deframing
MPEG TS deframing

See the Rx pipeline at: https://github.com/igorauad/gr-dvbs2rx/blob/master/apps/dvbs2-rx#L606

I don't know much more details about the multi-threading implementation. I'd assume you can find enough info on the GNU Radio project about the scheduler. Also, I'd imagine there are GRCon talks available on YouTube.

For instance, is the granularity of the GNU Radio multi-threading can be inside the DVB-S2 Rx Hierarchical Block or not? (if not, the AFF3CT approach could be complementary to GNU Radio)

Note the main Rx app (dvbs2-rx) does not using the hierarchical block. The hierarchical block is just a convenient wrapper for the example flowgraphs, which are more tailored to experimentation than production usage. In contrast, on the dvbs2-rx app, we instantiate the blocks individually. I decided this is better because it allows for a lot more flexibility. For example, you can choose two distinct implementations for the symbol synchronizer from the command line (see https://github.com/igorauad/gr-dvbs2rx/blob/master/apps/dvbs2-rx#L665). Also, the user may choose to stop the pipeline after the BBFRAME descrambler while outputting the descrambled BBFRAME stream instead of the MPEG TS stream (see https://github.com/igorauad/gr-dvbs2rx/blob/master/apps/dvbs2-rx#L640).

I think the best way to use aff3ct on this project is by replacing the LDPC decoder and/or BCH decoder. That's what I'm planning to do if I find that the aff3ct implementation is faster.

The benchmarking apps are compiled with option BENCHMARK_FEC discussed here: https://github.com/igorauad/gr-dvbs2rx/blob/aff3ct/docs/installation.md#build-options

Do you have some advice to compile the best version (in term of highest possible throughput) of the gr-dvbs2rx?

Definitely enable NATIVE_OPTIMIZATIONS. It is off by default because that makes the implementation more portable, especially when building binary packages. The best SIMD implementation for the LDPC decoder is decided in runtime, see here: https://github.com/igorauad/gr-dvbs2rx/blob/master/lib/ldpc_decoder_cb_impl.cc#L651. Hence, if you compile a package on a machine with AVX2 and run the package on another machine without AVX2, it will still work. In contrast, if you enable the NATIVE_OPTIMIZATIONS option, the project will be compiled with march=native and will only work on your machine. I believe that is what you want.

On which platform do you think it performs better (AMD AVX2 or Intel AVX-512 servers)?

There is no support for AVX-512 at the moment. AVX2 is the best available SIMD instruction set.

At the beginning AFF3CT has been designed to be an ECC toolbox (with fast decoder implementations). After that it grows and the need to make SDR systems arrived... At the end we don't want to make a clone of GNU Radio : it will be a non-sense.

Thanks for the great work on aff3ct. We've been following and using it for a long time, since our initial version Blockstream Satellite v1.0 (still available at https://github.com/Blockstream/gr-blocksat, but no longer used).

I'd vote for having aff3ct as efficient as possible on ECC implementations (both in CPU and memory) while letting GNU Radio do the rest :) Also, I wonder if aff3ct does runtime detection of SIMD capabilities or defines that in compilation time. Could you confirm?

I was also aware of your aff3ct/dvbs2 project. However, I've never had a chance to run it. Would it support QPSK 3/5 with normal FECFRAMEs and pilot symbols? Seems like another nice alternative for Blockstream Satellite.

Cheers

kouchy commented 2 years ago

Hi @igorauad,

Thank you very much for this complete answer, it is well appreciated!

I did not work a lot on optimizing the BCH decoder. To be honest it is also one of the limiting factor in our implementation... So, do not expect big improvement compared to the BCH decoder you are using in this project. I did not work personally on it and as far as I know it is a modified version of the Morelos-Zaragoza decoder's (the original version can be found here: http://www.eccpage.com/). We asked Morelos-Zaragoza permission to integrate it into the AFF3CT toolbox. We did not tried to reduce its memory footprint so I guess you might be right when you say it consumes too much memory...

However, I worked on efficient implementations of a generic demodulator and LDPC decoder, these implementations could be faster that the ones your are using (maybe). You can see some measured throughput performances in Table II in the paper we published (https://hal.archives-ouvertes.fr/hal-03336450/file/article.pdf).

Thank you for the links to the source code, it is helpful :-). For what I understand, GNURadio spans a pipeline stage for each blocks. In the AFF3CT DSEL (~= runtime) we propose a different approach:

we define the flow graph with all the blocks (called tasks in AFF3CT),
we group some blocks together to create a pipeline stage (this operation depends on the hardware architecture),
- this way there is potentially less stages than blocks (it is good for performance => less memory usage, better memory locality, less synchronizations)
  - some pipeline stages do not depend on previous frames => it is possible to instantiate multiple occurrences of the same pipeline stage (= a sequence of blocks) to increase the throughput of the system (this is what I call fork/join parallelism or sequence duplication), I think that GNU Radio is not able to do that (can you confirm?)
  - we can pin threads to specific CPU cores in the DSEL, this is important in many cores NUMA systems (GNURadio proposes this feature).

What I guess is that there is interesting methods in our approach, if it is confirmed we could then transpose these methods to the GNURadio runtime. An other important aspect of the AFF3CT DSEL is to propose loop and branch (if, switch) mechanisms. I'm not able to see how to do that in the GNURadio Companion. Do you think is it possible to model loops in GNURadio flow graphs?

In our DVB-S2 use case we are targeting many-core CPUs, this type of machines have a lot of memory so in our case the memory was not a constraint... However, in future work I would like to focus on low power systems (with less memory).

To answer your questions :

I'd assume elp stands for "error location polynomial"? Is it really necessary to store N vectors of size N?

I don't know :-/.

Also, I wonder if aff3ct does runtime detection of SIMD capabilities or defines that in compilation time. Could you confirm?

In AFF3CT the SIMD capabilities are defined at the compilation time. There is no runtime at this time.

I was also aware of your aff3ct/dvbs2 project. However, I've never had a chance to run it. Would it support QPSK 3/5 with normal FECFRAMEs and pilot symbols? Seems like another nice alternative for Blockstream Satellite.

At this time, if you use directly the project without doing any code modification, the AFF3CT DVB-S2 Tx/Rx only supports 3 MODCODs (QPSK 3/5, QPSK 8/9 and 8-PSK 8/9 with small frames (16200 bits for LDPC codewords)). Yes it supports pilot symbols (@rtajan can you confirm?). However with some minor modifications in the source code, the project should be able to support more MODCODs (there is, to my point of view, no limitation for this).

Best.

kouchy commented 2 years ago

Hi @igorauad,

I made some benchmarks on an Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz CPU (8 cores) on error free zone. I compiled the code with -DNATIVE_OPTIMIZATIONS=ON. Here is the command I used to run the Rx:

dvbs2-rx --modcod qpsk8/9 --frame-size short --source file --in-file samples.iq --sink file --out-file out.ts

I obtained an information throughput of 1.64 Mb/s.

Do you think this is near the expected throughput or did I missed something?

When I add --log-stats to the command line, the following log is repeated:

gr::log 2022-03-18 15:45:02,530 :INFO: {'lock': True, 'snr': 26.21209716796875, 'plsync': {'coarse_freq_corr': True, 'freq_offset_hz': 0.0009667706635241302, 'frame_count': {'processed': 8670, 'rejected': 0, 'dummy': 0}, 'locked_since': '2022-03-18T15:43:51.523712'}, 'fec': {'frames': 8640, 'errors': 0, 'fer': 0.0, 'avg_ldpc_trials': 0}, 'mpeg-ts': {'packets': 81298, 'per': 0.0}}

Thank you in advance.

igorauad commented 2 years ago

Hi @kouchy ,

Apologies for the delay in replying your earlier message.

as I know it is a modified version of the Morelos-Zaragoza decoder's (the original version can be found here: http://www.eccpage.com/).

Thanks for the link. I will check it out.

However, I worked on efficient implementations of a generic demodulator and LDPC decoder, these implementations could be faster than the ones you are using (maybe). You can see some measured throughput performances in Table II in the paper we published (https://hal.archives-ouvertes.fr/hal-03336450/file/article.pdf).

Very interesting, thanks for sharing. Yes, there is a lot of room for improvement here in terms of demodulation and decoding.

The block that I worked on the most is the PL Sync (low-PHY frame/frequency/phase recovery) and the symbol synchronizer. The former relies heavily on libvolk and is quite fast. The latter is at least a lot faster than the in-tree symbol synchronizer block from GNU Radio. However, the synchronizer is still one of the most expensive blocks, because it processes samples (oversampled sequence), not symbols, and because it is a bit hard to vectorize (it is a feedback loop). I'm still planning the improve it further by making better use of SIMD (e.g., direct SIMD implementation instead of calling Volk) and by trading estimation accuracy for lower CPU usage. But I'll only do so after I work on BCH, since, as I said, BCH performance is now the lowest hanging fruit to achieve better CPU usage.

Thank you for the links to the source code, it is helpful :-). For what I understand, GNURadio spans a pipeline stage for each blocks. In the AFF3CT DSEL (~= runtime) we propose a different approach:

Unfortunately, I don't understand the details of the GNU Radio scheduler. @marcusmueller might be able to help you and indicate where to look for further details in how the GR approach differs from yours. Also, perhaps check out his talk here: https://www.youtube.com/watch?v=cTGxhsSvZ9c. Marcus, is this still a relatively up-to-date talk or is there any more recent material?

Now, regarding your second comment:

I obtained an information throughput of 1.64 Mb/s. Do you think this is near the expected throughput or did I missed something?

The main parameter I think you are missing is the --sym-rate or -s for short. When reading the IQ recording from a file, gr-dvbs2rx simulates the real-time throughput of a regular receiver running at a specific symbol rate. Although, I recognize that maybe it would be better to make this an option disabled by default. The throttling happens here: https://github.com/igorauad/gr-dvbs2rx/blob/master/apps/dvbs2-rx#L474. So try with a very high symbol rate and see how it goes.

A throughput of 1.64 Mbps does seem low. For QPSK 8/9, the spectral efficiency is 1.766451 bits/sec/Hz, so 1.64 Mbps would be achievable with roughly 930 kbaud only. I've been running gr-dvbs2rx at 1 Mbaud in multiple machines with relatively low CPU usage. Hence, it doesn't seem like 930 kbaud is a reasonable limit.

Also, I don't know how fast is the file source block used when the IQ source (option --source) is set to file. In my tests at 1 Mbaud, I'm using the RTL-SDR source, so the file source block is not used (see here: https://github.com/igorauad/gr-dvbs2rx/blob/master/apps/dvbs2-rx#L450). The other alternative would be the file descriptor source, which is used with --source=fd (the default). For example, you can try:

cat samples.iq  | dvbs2-rx --modcod qpsk8/9 --frame-size short --sink file --out-file out.ts

However, I suspect this won't make a difference. And I'm hoping the file input is not a bottleneck. But since you are comparing with the aff3ct receiver, just bear this interface in mind.

Also, on the PL Sync block, there is a minor optimization when --pilots is set to on/off instead of auto. The difference is that, in this case, the block does not need to decode the PLSC. It already has the PLSC a priori, so it only needs to search the frame location. However, the CPU usage difference will be minimal, since this is not a very expensive computation anyway.

Of course, as you surely know, everything depends on the SNR. If the IQ recording you have has low SNR, then there will be more LDPC iterations, and the PL Sync block would possibly do more work if it ever loses frame sync. However, as far as I can tell, the IQ recording you are using is pretty clean, as I see 26 dB SNR on the --log-stats log. So that would not be the reason for the low performance.

For your reference, here is the CPU usage printed by top for the dvbs2-rx threads when I read an IQ file. Note how BCH is currently the bottleneck.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND   
18609 root      20   0 1796444 128256  67424 R  99.7   0.8   0:25.64 bch_decoder_bb1                                       
18608 root      20   0 1796444 128256  67424 R  38.9   0.8   0:15.62 ldpc_decoder_c1                                       
18606 root      20   0 1796444 128256  67424 S  13.0   0.8   0:03.40 symbol_sync_cc1                                       
18595 root      20   0 1796444 128256  67424 S   8.3   0.8   0:02.22 deinterleave2                                         
18604 root      20   0 1796444 128256  67424 S   5.6   0.8   0:01.46 agc_cc16                                              
18607 root      20   0 1796444 128256  67424 S   5.3   0.8   0:01.45 plsync_cc19                                           
18605 root      20   0 1796444 128256  67424 S   3.3   0.8   0:00.97 rotator_cc17                                          
18602 root      20   0 1796444 128256  67424 S   3.0   0.8   0:00.72 float_to_comple                                       
18600 root      20   0 1796444 128256  67424 S   2.3   0.8   0:00.57 add_const_ff5                                         
18601 root      20   0 1796444 128256  67424 S   2.3   0.8   0:00.59 multiply_const_                                       
18603 root      20   0 1796444 128256  67424 S   2.3   0.8   0:00.65 throttle10                                                                 
18596 root      20   0 1796444 128256  67424 S   2.0   0.8   0:00.48 uchar_to_float4                                       
18597 root      20   0 1796444 128256  67424 S   2.0   0.8   0:00.53 add_const_ff6                                                                               
18598 root      20   0 1796444 128256  67424 S   1.7   0.8   0:00.55 multiply_const_                                       
18599 root      20   0 1796444 128256  67424 S   1.7   0.8   0:00.49 uchar_to_float3                                       
18594 root      20   0 1796444 128256  67424 S   1.3   0.8   0:00.45 file_descriptor                                                                           
18610 root      20   0 1796444 128256  67424 S   0.7   0.8   0:00.15 bbdescrambler_1                                   
18611 root      20   0 1796444 128256  67424 S   0.3   0.8   0:00.10 bbdeheader_bb15

At some point, I would like to make a script to automate the process of running dvbs2-rx and measuring both the number of recovered/decoded bytes and the time it took to process the IQ recording. Just a simple script, but I haven't had the time to implement it yet.

Lastly, if you spend some time profiling and searching where dvbs2-rx is slowing down the most, I would be very interested in the results.

Thanks a lot for sharing your results.

kouchy commented 2 years ago

Hi @igorauad,

Thank you very much for your precise and exhaustive answers.

I removed the throttling for the benchmarks, now I obtain an information throughput of 7.4 Mb/s, it is better :-).

Thx for the link to the talk, I listened this talk before and I'm now aware of the GNURadio newsched project. I found that many of the ideas and conclusions are similar with what we did in the AFF3CT DSEL+runtime. To the best of my knowledge there is also different approaches that I will detail in the paper I'm writing.

However, I obtain different block performance, I'm still in an error free zone and here is my top output :

top - 12:00:30 up 2 days,  1:10,  3 users,  load average: 1,74, 0,67, 0,52
Threads: 473 total,   4 running, 469 sleeping,   0 stopped,   0 zombie
%Cpu(s): 47,7 us, 12,2 sy,  0,0 ni, 39,7 id,  0,0 wa,  0,0 hi,  0,4 si,  0,0 st
MiB Mem :  31893,8 total,  10461,7 free,    644,1 used,  20787,9 buff/cache
MiB Swap:  32768,0 total,  32714,2 free,     53,8 used.  30747,4 avail Mem 

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                            
  59159 root      20   0 1823240 138576  76736 R  99,0   0,4   0:10.10 deinterleave2                                      
  59163 root      20   0 1823240 138576  76736 S  50,2   0,4   0:05.10 uchar_to_float3                                    
  59160 root      20   0 1823240 138576  76736 S  49,8   0,4   0:05.07 uchar_to_float4                                    
  59172 root      20   0 1823240 138576  76736 R  42,5   0,4   0:04.34 bch_decoder_bb1                                    
  59169 root      20   0 1823240 138576  76736 S  38,5   0,4   0:03.91 symbol_sync_cc1                                    
  59161 root      20   0 1823240 138576  76736 S  36,5   0,4   0:03.78 add_const_ff6                                      
  59164 root      20   0 1823240 138576  76736 S  36,2   0,4   0:03.64 add_const_ff5                                      
  59166 root      20   0 1823240 138576  76736 R  34,9   0,4   0:03.54 float_to_comple                                    
  59162 root      20   0 1823240 138576  76736 S  29,2   0,4   0:03.01 multiply_const_                                    
  59165 root      20   0 1823240 138576  76736 S  28,6   0,4   0:02.87 multiply_const_                                    
  59171 root      20   0 1823240 138576  76736 S  20,9   0,4   0:02.12 ldpc_decoder_c1                                    
  59167 root      20   0 1823240 138576  76736 S  15,6   0,4   0:01.62 agc_cc15                                           
  59170 root      20   0 1823240 138576  76736 S  11,0   0,4   0:01.13 plsync_cc18                                        
  59168 root      20   0 1823240 138576  76736 S   8,0   0,4   0:00.81 rotator_cc16                                       
  59158 root      20   0 1823240 138576  76736 S   2,3   0,4   0:00.21 file_source1                                       
  59173 root      20   0 1823240 138576  76736 S   1,0   0,4   0:00.10 bbdescrambler_1                                    
  59174 root      20   0 1823240 138576  76736 S   1,0   0,4   0:00.10 bbdeheader_bb14                                    
  59175 root      20   0 1823240 138576  76736 S   0,7   0,4   0:00.06 file_sink10

It is weird to have deinterleave2, uchar_to_float3 and uchar_to_float4 consuming that much, isn't it? I guess there is something wrong. Normally the cost of the deinterleave2 block should be almost nothing with the QPSK R=8/9 MODCOD (according to the standard there is no interleaving). If I could find what is wrong I guess the BCH will be the limiting factor and the throughput will be more than doubled...

Do you have an idea?

Thank you again for your time and your help!

Bests.

igorauad commented 2 years ago

Hi @kouchy

I removed the throttling for the benchmarks, now I obtain an information throughput of 7.4 Mb/s, it is better :-).

I'm glad the performance improved a bit :) However, it is not that high yet, though. I'm hoping you can achieve something a little faster.

To the best of my knowledge there is also different approaches that I will detail in the paper I'm writing.

Nice!

However, I obtain different block performance, I'm still in an error free zone and here is my top output :

I see. So, because you are operating error-free, LDPC and BCH do not consume as much. In the top output I sent earlier, I have around 8 dB SNR, so the codes are doing some work. Btw, if you are using dvbs2-tx to generate the IQ file, you can simulate noise by running it with option --snr.

It is weird to have deinterleave2, uchar_to_float3 and uchar_to_float4 consuming that much, isn't it?

To be clear, this deinterleave block has nothing to do with the bit deinterleaver used in DVB-S2 for 8PSK and beyond. Instead, it is used in the pipeline responsible for converting the IQ format when reading an IQ file or receiving via a file descriptor. The blocks you mentioned (deinterleave and uchar_to_float) are not used at all when receiving IQ samples via an RTL-SDR or USRP.

The assumption is that the IQ file (or analogous input via file descriptor) has I and Q samples represented by interleaved chars. However, the flowgraph processes complex numbers (type gr_complex), not chars, so they must be converted back to a complex stream. See the pipeline here: https://github.com/igorauad/gr-dvbs2rx/blob/master/apps/dvbs2-rx#L466

I guess these blocks were simply not implemented for speed. You could get rid of them and try again.

One way to get rid of these blocks is by saving the IQ file in a different format. It seems interleaved I/Q chars is the typical format, but it is not mandatory. The Tx side implements the opposite to convert complex numbers to I/Q chars, see here: https://github.com/igorauad/gr-dvbs2rx/blob/master/apps/dvbs2-tx#L101

The pipeline is like so:

complex_to_interleaved_chars

In the code, it is:

            complex_to_float_0 = blocks.complex_to_float(1)
            multiply_const_0 = blocks.multiply_const_ff(128)
            multiply_const_1 = blocks.multiply_const_ff(128)
            add_const_0 = blocks.add_const_ff(127)
            add_const_1 = blocks.add_const_ff(127)
            float_to_uchar_0 = blocks.float_to_uchar()
            float_to_uchar_1 = blocks.float_to_uchar()
            interleaver = blocks.interleave(gr.sizeof_char, 1)

This chain could be entirely bypassed if the file sink (here) took a complex input instead of a char input. That is, with a change like the following on dvbs2-tx:

@@ -98,29 +98,14 @@ class dvbs2_tx(gr.top_block):
             # Convert the complex IQ stream into an interleaved uchar stream.
             throttle = blocks.throttle(gr.sizeof_gr_complex, self.samp_rate,
                                        True)
-            complex_to_float_0 = blocks.complex_to_float(1)
-            multiply_const_0 = blocks.multiply_const_ff(128)
-            multiply_const_1 = blocks.multiply_const_ff(128)
-            add_const_0 = blocks.add_const_ff(127)
-            add_const_1 = blocks.add_const_ff(127)
-            float_to_uchar_0 = blocks.float_to_uchar()
-            float_to_uchar_1 = blocks.float_to_uchar()
-            interleaver = blocks.interleave(gr.sizeof_char, 1)

             if (self.sink == "fd"):
                 file_or_fd_sink = blocks.file_descriptor_sink(
-                    gr.sizeof_char, self.out_fd)
+                    gr.sizeof_gr_complex, self.out_fd)
             else:
-                file_or_fd_sink = blocks.file_sink(gr.sizeof_char,
+                file_or_fd_sink = blocks.file_sink(gr.sizeof_gr_char,
                                                    self.out_file)
-            self.connect((throttle, 0), (complex_to_float_0, 0))
-            self.connect((complex_to_float_0, 0), (multiply_const_0, 0))
-            self.connect((complex_to_float_0, 1), (multiply_const_1, 0))
-            self.connect((multiply_const_0, 0), (add_const_0, 0),
-                         (float_to_uchar_0, 0), (interleaver, 0))
-            self.connect((multiply_const_1, 0), (add_const_1, 0),
-                         (float_to_uchar_1, 0), (interleaver, 1))
-            self.connect((interleaver, 0), (file_or_fd_sink, 0))
+            self.connect((throttle, 0), (file_or_fd_sink, 0))
             # First block on the pipeline
             sink = throttle
         elif (self.sink == "usrp"):

Correspondingly, on dvbs2-rx, you can make the following changes:

@@ -457,34 +457,15 @@ class DVBS2RxTopBlock(gr.top_block, Qt.QWidget):
         if (self.source == "fd" or self.source == "file"):
             if (self.source == "fd"):
                 blocks_file_or_fd_source = blocks.file_descriptor_source(
-                    gr.sizeof_char, self.in_fd, False)
+                    gr.sizeof_gr_complex, self.in_fd, False)
             else:
                 blocks_file_or_fd_source = blocks.file_source(
-                    gr.sizeof_char, self.in_file, self.in_repeat)
+                    gr.sizeof_gr_complex, self.in_file, self.in_repeat)
             # Pipeline to convert the fd/file IQ stream into a complex stream,
             # assuming the independent I and Q are uint8_t streams.
-            blocks_deinterleave = blocks.deinterleave(gr.sizeof_char, 1)
-            blocks_uchar_to_float_0 = blocks.uchar_to_float()
-            blocks_uchar_to_float_1 = blocks.uchar_to_float()
-            blocks_add_const_ff_0 = blocks.add_const_ff(-127)
-            blocks_add_const_ff_1 = blocks.add_const_ff(-127)
-            blocks_multiply_const_ff_1 = blocks.multiply_const_ff(1 / 128)
-            blocks_multiply_const_ff_0 = blocks.multiply_const_ff(1 / 128)
-            blocks_float_to_complex_0 = blocks.float_to_complex(1)
             blocks_throttle_0 = blocks.throttle(gr.sizeof_gr_complex,
                                                 self.samp_rate, True)
-            self.connect((blocks_file_or_fd_source, 0),
-                         (blocks_deinterleave, 0))
-            self.connect(
-                (blocks_deinterleave, 0), (blocks_uchar_to_float_0, 0),
-                (blocks_add_const_ff_0, 0), (blocks_multiply_const_ff_0, 0),
-                (blocks_float_to_complex_0, 0))
-            self.connect(
-                (blocks_deinterleave, 1), (blocks_uchar_to_float_1, 0),
-                (blocks_add_const_ff_1, 0), (blocks_multiply_const_ff_1, 0),
-                (blocks_float_to_complex_0, 1))
-            self.connect((blocks_float_to_complex_0, 0),
-                         (blocks_throttle_0, 0))
+            self.connect((blocks_file_or_fd_source, 0), (blocks_throttle_0, 0))
             source = blocks_throttle_0
         elif (self.source == "rtl"):

See if that helps :)

kouchy commented 2 years ago

Hi @igorauad,

Thx for you help again, it is well appreciated :-). I forked the project and I applied the modifications on the thr_benchmark branch (https://github.com/kouchy/gr-dvbs2rx/tree/thr_benchmark).

On the previous CPU (Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz CPU (8 cores)) I obtain a throughput of 17 Mbps!

I also ran the code on the server (2x Xeon Platinum 8168 @ 2.7GHz) we are using for the paper and here are the results :

top - 15:58:48 up 29 min,  2 users,  load average: 1.90, 3.29, 3.80
Threads: 790 total,   3 running, 542 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.5 us,  0.6 sy,  0.0 ni, 91.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13163666+total,  9350700 free,  1312000 used, 12097396+buff/cache
KiB Swap:  8388604 total,  8388604 free,        0 used. 12925460+avail Mem 

                                                      Throughput in Mbps
   RES    SHR S %CPU %MEM     TIME+ COMMAND          `gr_dvbs2rx` AFF3CT
136812  75856 R 99.7  0.1   0:10.59 bch_decoder_bb4         14.9     6.9
136812  75856 S 88.2  0.1   0:09.42 symbol_sync_cc9         16.9     ?.?
136812  75856 R 64.8  0.1   0:06.95 ldpc_decoder_cb         23.0   164.2
136812  75856 S 62.2  0.1   0:06.60 agc_cc7                 24.0   367.5
136812  75856 S 32.9  0.1   0:03.50 plsync_cc10             45.3     ?.?
136812  75856 S 15.5  0.1   0:01.64 rotator_cc8             96.1     ?.?
136812  75856 S 14.8  0.1   0:01.59 file_source1           100.7   431.8
136812  75856 S  6.6  0.1   0:00.66 bbdescrambler_b        225.8     ?.?
136812  75856 S  5.9  0.1   0:00.59 bbdeheader_bb6         252.5    91.1
136812  75856 S  4.3  0.1   0:00.46 file_sink2             346.5  1838.3

In error free zone your BCH decoder is faster than the AFF3CT BCH decoder :-). The info. throughput is 14.9 Mbps on this server.

igorauad commented 2 years ago

Hi @kouchy

Very nice! That is good progress!

If I'm understanding correctly, the Aff3ct LDPC decoder is approximately 10x faster? That sounds appealing :)

Just curious, are you generating these results with a publicly-available tool? Is it part of aff3ct? Or custom-built for the paper experiments?

In error free zone your BCH decoder is faster than the AFF3CT BCH decoder :-).

Interesting that the gr-dvbs2rx BCH decoder is faster. I wonder what it will look like after an optimization round (including some SIMD). I'll report when I find a chance to work on it.

Thanks again

kouchy commented 2 years ago

Hi @igorauad,

You're welcome.

Yes the AFF3CT LDPC decoder sounds like much more efficient (and the throughput could be even almost doubled on fixed 16-bit integers).

Just curious, are you generating these results with a publicly-available tool? Is it part of aff3ct? Or custom-built for the paper experiments?

This is custom-built for the paper experiments. For gr_dvbs2rxit is an approximation. I know the throughput of the entire Rx because I know the execution time and the size of the input file (on the Tx side). Then, because it is a pipeline, I know that this throughput is the thoughput of the slowest block in gr_dvbs2rx. After that, with the CPU occupancy (%CPUcolumn in top -H), I can deduce the throughput of the other blocks (there is more than 10 HW cores on this machine, so the approximation should be not far from the reality). For AFF3CT, there is an integrated --sim-stats option that performs the per block measurement.

Now I'm trying to fill the ?.?. Could you help me to precise what are doing the following blocks, please?

symbol_sync_cc: I guess this is the timing synchro (Gardner) and maybe the PSK demodulation?
rotator_cc: I guess this is the frequency synchro (Coarse + Fine L&R + Fine P/F) ?
plsync_cc: I guess this is the frame synchro ?

I'm not sure where you are performing the PSK demodulation, could you help me to find where please?

Interesting that the gr-dvbs2rx BCH decoder is faster. I wonder what it will look like after an optimization round (including some SIMD). I'll report when I find a chance to work on it.

Clearly I'm interested!

Thank you for your time.

igorauad commented 2 years ago

Yes the AFF3CT LDPC decoder sounds like much more efficient (and the throughput could be even almost doubled on fixed 16-bit integers).

Nice. However, I should point out that the gr-dvbs2rx's LDPC decoder is not just a decoder. A more accurate name for it would be "XFECFRAME-to-BCH-codeword", which is what it does. It takes the XFECFRAME in, does constellation demapping, bit deinterleaving, then finally LDPC decoding. The output is the BCH codeword with byte-packed hard decisions, which goes into the BCH decoder block.

The constellation demapping happens here: https://github.com/igorauad/gr-dvbs2rx/blob/master/lib/ldpc_decoder_cb_impl.cc#L1108.
The bit-deinterleaving is right after the above line. Although, it may not be that readable because it also includes the data reorganization to process multiple LDPC codewords simultaneously with SIMD (32 codewords with AVX2 and 16 with NEON/SSE4_1).
The LDPC decoding is called here: https://github.com/igorauad/gr-dvbs2rx/blob/master/lib/ldpc_decoder_cb_impl.cc#L1290.
The bit packing into bytes and hard decisions happens here: https://github.com/igorauad/gr-dvbs2rx/blob/master/lib/ldpc_decoder_cb_impl.cc#L1448.

So this leads us to further clarifications about the blocks.

I'm not sure where you are performing the PSK demodulation, could you help me to find where please?

As explained above, it happens on the LDPC decoder block. At least for now. I plan to refactor this and separate the XFECFRAME/deinterleaving into its own block while leaving the LDPC decoder block a pure LDPC decoder (possibly with the aff3ct decoder).

symbol_sync_cc: I guess this is the timing synchro (Gardner) and maybe the PSK demodulation?

The symbol synchronizer does two things simultaneously: the root-raised cosine (RRC) matched filtering and symbol timing recovery using a Gardner timing error detector (TED) as you mentioned. As I mentioned before, I have developed this block because the in-tree one is too slow. However, it is still possible to alternate between the two using option --sym-sync-impl on dvbs2-rx. If curious, try with --sym-sync-impl in-tree to see the difference.

The in-tree version is more generic. It supports multiple TEDs, fractional resampling rates, and multiple interpolators. In contrast, my implementation is a bit more focused. It only supports the Gardner TED and integer decimation ratios. My implementation does support multiple interpolators, but the fastest is definitely the polyphase interpolator that does joint RRC filtering and symbol timing recovery. If interested, I spent some time looking at these methods and derived most conclusions from the experiments in https://github.com/igorauad/symbol_timing_sync.

rotator_cc: I guess this is the frequency synchro (Coarse + Fine L&R + Fine P/F) ?

Not really. The rotator is just a really simple multiplication of the input by a exp(j*2*pi*fo) complex exponential. It is responsible for correcting a frequency offset fo, but it doesn't know which frequency offset to correct. The block that estimates the frequency offset is the PL Sync block, which then controls the rotator.

The reason why I added the rotator as a separate block is because I wanted to execute the frequency offset correction before the symbol timing recovery. The symbol timing recovery algorithm with the Gardner detector is robust to frequency offsets, but it is better to have it operating with low frequency offsets.

You can see below how the PL Sync block has a message port to the rotator. This is the port over which it continuously updates the frequency offset corrected by the rotator. Note also the symbol synchronizer lies in between.

dvbs2rx_flowgraph

plsync_cc: I guess this is the frame synchro ?

This block performs frame synchronization, coarse and fine frequency offset estimation, phase correction, PL descrambling, PLSC decoding, frame locking logic, and PL pilot removal. That is, it takes the noisy PLFRAMEs in and outputs the corresponding XFECFRAMEs. In other words, it does all the magic to output XFECFRAMEs reliably to the LDPC decoder block. You can find more information on the docstrings over the code. This block and its underlying modules are all reasonably well documented:

https://github.com/igorauad/gr-dvbs2rx/blob/master/include/dvbs2rx/plsync_cc.h#L19

https://github.com/igorauad/gr-dvbs2rx/blob/master/lib/plsync_cc_impl.h#L67

The noteworthy limitation about the current state of the PL Sync block is that it doesn't work well without PL pilots yet. I haven't had a chance to focus on pilotless operation yet. In the Blockstream Satellite system, we have pilots enabled, so pilot mode has been the focus.

kouchy commented 2 years ago

Ok thx! It is clear now! It enables a fair comparison with the AFF3CT blocks. Here is what I obtain:

symbol_sync_cc = matched filter + synchro timing (Gardner) = t_4 + t_5 + t_6
rotator_cc = complex multiplication = t_7
plsync_cc = synchro frame + synchro freq + symbol descrambler + remove PLH + noise estimation = t_3 + t_8 + t_9 + t_10 + t_11 + t_12 + t_13
ldpc_decoder_cb = demap + deinterleave + decode = t_14 + t_15 + t_16

symbol_sync_cc [t_4+t_5+t_6]:                l=6872.57 us - T=(16*14232)/l= 33.13 Mb/s
rotator_cc [t_7]:                            l= 332.18 us - T=(16*14232)/l=685.51 Mb/s
plsync_cc [t_3+t_8+t_9+t_10+t_11+t_12+t_13]: l=4709.84 us - T=(16*14232)/l= 48.35 Mb/s
ldpc_decoder_cb [t_14+t_15+t_16]:            l=7182.10 us - T=(16*14232)/l= 31.71 Mb/s

--------------------------------

top - 15:58:48 up 29 min,  2 users,  load average: 1.90, 3.29, 3.80
Threads: 790 total,   3 running, 542 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.5 us,  0.6 sy,  0.0 ni, 91.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13163666+total,  9350700 free,  1312000 used, 12097396+buff/cache
KiB Swap:  8388604 total,  8388604 free,        0 used. 12925460+avail Mem 

                                                      Throughput in Mbps
   RES    SHR S %CPU %MEM     TIME+ COMMAND          `gr_dvbs2rx` AFF3CT
136812  75856 R 99.7  0.1   0:10.59 bch_decoder_bb4         14.9     6.9
136812  75856 S 88.2  0.1   0:09.42 symbol_sync_cc9         16.9    33.1 
136812  75856 R 64.8  0.1   0:06.95 ldpc_decoder_cb         23.0    31.7
136812  75856 S 62.2  0.1   0:06.60 agc_cc7                 24.0   367.5
136812  75856 S 32.9  0.1   0:03.50 plsync_cc10             45.3    48.4 
136812  75856 S 15.5  0.1   0:01.64 rotator_cc8             96.1   685.5 
136812  75856 S 14.8  0.1   0:01.59 file_source1           100.7   431.8
136812  75856 S  6.6  0.1   0:00.66 bbdescrambler_b        225.8    91.1
136812  75856 S  5.9  0.1   0:00.59 bbdeheader_bb6         252.5       -
136812  75856 S  4.3  0.1   0:00.46 file_sink2             346.5  1838.3

As you can see, I normalized the throughputs with the number of information bits. The t_i indexing refers to the tasks (~= GR blocks) in our previous paper (https://hal.archives-ouvertes.fr/hal-03336450/file/article.pdf).

igorauad commented 2 years ago

Thanks, @kouchy .

Interesting to see the Aff3ct LDPC decoder continues to be faster even though you are now including the other stages (t14 and t15).

Also, I'm curious about your symbol synchronizer implementation. I haven't had a chance to read the paper. However, from a quick glance, I was under the impression that some stages (like the symbol synchronizer) are trained in the beginning over some frames (an acquisition phase) and then stop tracking the estimates. Is that how it works? Or do these blocks continue the work for as long as the simulation is running?

JFYI, the PL Sync and Symbol Sync implementations on gr-dvbs2rx track the symbol timing and carrier frequency/phase offsets continuously. They can't really stop doing so since the carrier frequency and sampling clock are changing all the time. The only part of the processing that is significantly reduced after an "acquisition phase" is the frame timing recovery. The latter is initially based on cross-correlation over the entire stream of samples. However, when the right timing is found (a cross-correlation peak), and the PLSC is decoded, the block knows where to expect the next cross-correlation peak, so it no longer needs to calculate the cross-corr over the entire stream. At this point, it only calculates the cross-corr on the next expected peak to ensure the peak is indeed observed. So this part of the processing starts with a relatively high CPU usage and quickly drops to a very low usage when the frame timing is locked.

kouchy commented 2 years ago

Hi @igorauad ,

Also, I'm curious about your symbol synchronizer implementation. I haven't had a chance to read the paper. However, from a quick glance, I was under the impression that some stages (like the symbol synchronizer) are trained in the beginning over some frames (an acquisition phase) and then stop tracking the estimates. Is that how it works? Or do these blocks continue the work for as long as the simulation is running?

These blocks continue the work as long as the simulation is running (same as you).

JFYI, the PL Sync and Symbol Sync implementations on gr-dvbs2rx track the symbol timing and carrier frequency/phase offsets continuously. They can't really stop doing so since the carrier frequency and sampling clock are changing all the time. The only part of the processing that is significantly reduced after an "acquisition phase" is the frame timing recovery. The latter is initially based on cross-correlation over the entire stream of samples. However, when the right timing is found (a cross-correlation peak), and the PLSC is decoded, the block knows where to expect the next cross-correlation peak, so it no longer needs to calculate the cross-corr over the entire stream. At this point, it only calculates the cross-corr on the next expected peak to ensure the peak is indeed observed. So this part of the processing starts with a relatively high CPU usage and quickly drops to a very low usage when the frame timing is locked.

This is the same for us.

Thank you again for your time and your help @igorauad. It is sincerely very well appreciated. I think it enables a fair comparison (this is exactly what I wanted to do). I will keep you updated if the paper is accepted :-).

igorauad commented 2 years ago

No worries, @kouchy . Happy to help!

Thanks for the reporting the interesting experiments and good luck with the paper!

igorauad commented 2 years ago

Hi @kouchy ,

FYI, I have merged the following changes related to this issue:

23707d5381ca44e4fa87cb0e5109d5108c339d7c Changes the file input/output interface to std::complex<float> format (called fc32) by default instead of interleaved u8. With that, the pipeline for IQ format conversion (discussed in https://github.com/igorauad/gr-dvbs2rx/issues/6#issuecomment-1075561205) is now disabled by default. It is only enabled with option --out-iq-format u8 on Tx and --in-iq-format u8 on Rx. For your experiments, the default is what you want.
8d95039acaaaaa3d97da814709e637661a32c47c Changes the behavior of the Rx app when reading IQ samples from a file or file descriptor. Now, by default, it will read as fast as possible. That is, it will not simulate the real-time symbol rate as discussed in https://github.com/igorauad/gr-dvbs2rx/issues/6#issuecomment-1072663180. To simulate the symbol rate, it is now necessary to run with option --in-real-time.

Let me know if you have more questions. Otherwise, I will close the issue for now.

Thanks again.

kouchy commented 2 years ago

Hi @igorauad,

Thx for keeping me updated. I think these are good modifications. I don't have other questions on this topic, at least for now.

I close the issue.

Best.