tapparelj / gr-lora_sdr

This is the fully-functional GNU Radio software-defined radio (SDR) implementation of a LoRa transceiver with all the necessary receiver components to operate correctly even at very low SNRs. This work has been conducted at the Telecommunication Circuits Laboratory, EPFL.
https://www.epfl.ch/labs/tcl/
GNU General Public License v3.0
657 stars 67 forks source link

Performance of cyl_bessel_i() on a low-powered arm64 device #92

Open mskvortsov opened 4 months ago

mskvortsov commented 4 months ago

While running the receiver on a low-powered device like Raspberry Pi, I'm seeing a high CPU load. A signal gets sampled at a 5 Msps rate, SF 11, BW 250.

A quick profiling of a run-to-completion flow from a File Source w/o throttling block shows the boost::math::cyl_bessel_i() function takes a substantial time. As it turns out, a default Boost math policy promotes doubles to long doubles the device is struggling to compute with.

The promotion can be disabled as described in https://www.boost.org/doc/libs/1_85_0/libs/math/doc/html/math_toolkit/tradoffs.html:

diff --git a/lib/fft_demod_impl.cc b/lib/fft_demod_impl.cc
index 784403a..f622ada 100644
--- a/lib/fft_demod_impl.cc
+++ b/lib/fft_demod_impl.cc
@@ -14,2 +14,5 @@ extern "C" {

+using namespace boost::math::policies;
+auto no_double_promotion_policy = make_policy(promote_double<false>());
+
 namespace gr {
@@ -197,3 +200,4 @@ namespace gr {
                 if (bessel_arg < 713)  // 713 ~ log(std::numeric_limits<LLR>::max())
-                    LLs[n] = boost::math::cyl_bessel_i(0, bessel_arg);  // compute Bessel safely
+                    // TODO? std::cyl_bessel_i() exists since C++17
+                    LLs[n] = boost::math::cyl_bessel_i(0, bessel_arg, no_double_promotion_policy);  // compute Bessel safely
                 else {

The fix gives a whopping ~3x speed up on RPi4 without decoding degradation on my signal. However, I don't know whether this long double precision is strictly required and can be downgraded just like that.

miweber67 commented 4 months ago

The fix gives a whopping ~3x speed up on RPi4 without decoding degradation on my signal. However, I don't know whether this long double precision is strictly required and can be downgraded just like that.

You could create a set of test input files of varying 'quality' by adding varying amounts of Gaussian white noise and center frequency shift to see if the precision is an issue for those variables.

mskvortsov commented 4 months ago

I didn't see any difference in response in terms of the number of packets decoded with valid CRC's. I used Channel Model block and varied noise_voltage and frequency_offset parameters independently in small steps until the number of valid crc's declined to zero. On the other hand, there are too many other LoRa block configurations to make a definite conclusion from this limited experiment.

However, a more obvious point is that my 5 Msps sampling rate is somewhat high, and unfortunately, it's the lowest usable rate of my receiver. cyl_bessel_i() is executed in the order of O(samp_rate * 2^sf) times, so reducing the input sampling rate would probably be a simpler approach for my particular problem.

miweber67 commented 4 months ago

I didn't see any difference in response in terms of the number of packets decoded with valid CRC's. I used Channel Model block and varied noise_voltage and frequency_offset parameters independently in small steps until the number of valid crc's declined to zero. On the other hand, there are too many other LoRa block configurations to make a definite conclusion from this limited experiment.

Nice... a single data point to be sure, but, it's a pleasant single data point. :-)

However, a more obvious point is that my 5 Msps sampling rate is somewhat high, and unfortunately, it's the lowest usable rate of my receiver. cyl_bessel_i() is executed in the order of O(samp_rate * 2^sf) times, so reducing the input sampling rate would probably be a simpler approach for my particular problem.

So your frame_sync of_factor is ... 20? In issue 91 it was suggested that 4 should be adequate. If you filter and decimate by 5, do you still get good results?

mskvortsov commented 4 months ago

It looks like Low Pass Filter and Rational Resampler are quite CPU intensive. A receiver flow with additional filtering or resampling blocks makes 4x more load and occupies one Cortex-A72 core entirely. I'm going to try just a cheapo 1Msps radio the next week.