A bug in decoder implementation

jkadbear commented 4 years ago

In decoder_impl.cc line 454, the implementation of method max_frequency_gradient_idx has a bug.

            float max_gradient = 0.1f;
            float gradient = 0.0f;
            uint32_t max_index = 0;
            for (uint32_t i = 1u; i < d_number_of_bins; i++) {
                gradient = samples_ifreq_avg[i - 1] - samples_ifreq_avg[i];
                if (gradient > max_gradient) {
                    max_gradient = gradient;
                    max_index = i+1;
                }
            }

            return (d_number_of_bins - max_index) % d_number_of_bins;

If the max gradient occurs at i=d_number_of_bins-1, the return value will be 0. On the other hand, if the symbol is an unshifted chirp which means all calculated gradient is negative (less than the initial value of max_gradient 0.1f), the return value is also 0. Two different symbols are mapped to a same value. Such error occurs when an unshifted chirp symbol is received.

For example, let the LoRa node send 4 bytes "0x11 0x10 0x10 0x01" with parameter setting SF=7, CR=4/5 and BW=125kHz and the first data symbol would be an unshifted chirp. The right decoding result should be "04 30 60 11 10 10 1 82 D3". However, current gr-lora gives "04 30 60 10 10 10 1 82 D3".

uint32_t max_index = 1; fixes this bug.

rpp0 commented 4 years ago

Hmm, I'm not sure about this. The reason why these map to the same value is because you have 2^SF+1 possible values if you use the gradient to demodulate. The extra value being the case where the symbol is (perfectly) unshifted and we have no gradient as opposed to a gradient in any of the 2^SF bins. In reality, there is always some offset due to alignment issues and the gradient of unshifted symbols ends up near i=d_number_of_bins-1, in which case the value of the symbol is 0. However, in cases the symbol is perfectly unshifted, I think it should map to 0 rather than d_number_of_bins - 1.

When using LoRa's original FFT demodulation, you always have a peak but a peak at 0 and at 2^SF would similarly map to the same value of 0.

Anyway, I could be wrong but I tested your suggestion using the test suites. The accuracy for the short_usrp and short_hackrf test sets drops from 100% to 97.40% if I use max_index = 1 (see the deadbeef payloads where error correction is disabled). Perhaps the reason why you're seeing an incorrect decoding is due to a bit error in the reverse engineered whitening sequence instead of the gradient demodulation.

jkadbear commented 4 years ago

Sorry that I did not test my suggestion first. I tried the test suites and found that I could not solve this problem. Under setting SF=7, CR=4/5 and BW=125kHz:

uint32_t max_index = 0; cannot decode 0x11, 0x10, 0x10, 0x01
uint32_t max_index = 1; cannot decode 0xde, 0xad, 0xbe, 0xef

Besides, there are some faults in your description.

The reason why these map to the same value is because you have 2^SF+1 possible values if you use the gradient to demodulate.

Suppose SF=2, we would have only 4 valid symbols representing 2 bits as shown below.

How could you get 2^SF+1=5 possible values using the gradient to demodulate?

When using LoRa's original FFT demodulation, you always have a peak but a peak at 0 and at 2^SF would similarly map to the same value of 0.

I think FFT demodulation gives bin range 0~2^SF-1 instead of 0~2^SF. The number of bins (when sampling frequency = BW) ought to be a power of 2. Bin 2^SF does not exist.

So you might derive the wrong reverse engineered whitening sequence with gradient demodulation?

rpp0 commented 4 years ago

Sorry that I did not test my suggestion first. I tried the test suites and found that I could not solve this problem.

No problem! I'm glad you're thinking along and helping to improve the decoder. Much appreciated. The decoder can certainly be improved.

How could you get 2^SF+1=5 possible values using the gradient to demodulate?

Perhaps my explanation was a bit confusing. I'm not claiming there are 2^SF+1 valid values to decode. Rather, the gradient decoding method will yield 2^SF+1 possible values of which 2^SF are valid (which is why two values map to 0). For SF=2 you are correct that there are only 4 valid LoRa symbols in the range [0, 3]. Let's calculate the gradient:

Case a: no gradient
Case b: gradient at x = 3
Case c: gradient at x = 2
Case d: gradient at x = 1

Suppose now for simplicity that if the gradient falls in bin [3, 4[, we will assign LoRa symbol value 3, in bin [2, 3[ we assign value 2, etc. Then note that there is also the case where a gradient occurs in bin [0, 1. This can happen in practice if the signal is not perfectly aligned. It is why this bin is mapped to the same value as case a, where there is no gradient at all. In your suggestion above, the value of case b would be returned instead.

I think FFT demodulation gives bin range 0-2^SF-1 instead of 0-2^SF. The number of bins (when sampling frequency = BW) ought to be a power of 2. Bin 2^SF does not exist.

I didn't say that you have a bin at 2^SF. What I am saying here is that if you fully rotate a LoRa symbol it will just be equivalent / map to the same value as an unshifted symbol 0, similar to how a 360 degree rotation on a circle is equivalent to a 0 degree rotation.

jkadbear commented 4 years ago

Suppose now for simplicity that if the gradient falls in bin [3, 4[,

Bin 4 does not exist. If you mean bin 0, you have to calculate gradient circularly. But current gr-lora does not do this.

Let us use the code of gradient method in gr-lora to decode case (a)(b)(c)(d)

            float max_gradient = 0.1f;
            float gradient = 0.0f;
            uint32_t max_index = 0;
            for (uint32_t i = 1u; i < d_number_of_bins; i++) {
                gradient = samples_ifreq_avg[i - 1] - samples_ifreq_avg[i];
                if (gradient > max_gradient) {
                    max_gradient = gradient;
                    max_index = i+1;
                }
            }

            return (d_number_of_bins - max_index) % d_number_of_bins;

Note that i=1,2,3 and max_index=i+1. The return values are

Case (a): (4 - 0) % 4 = 0
Case (b): (4 - (3+1)) % 4 = 0
Case (c): (4 - (2+1)) % 4 = 1
Case (d): (4 - (1+1)) % 4 = 2

Case (a) and Case (b) have the same decoded value of 0 ??

rpp0 commented 4 years ago

Bin 4 does not exist.

With the notation [3, 4[ I indeed meant "excluding 4", sorry if this was not clear. This refers to bin 3. [0, 1[ refers to bin 0.

The return values are snip

You forgot case (e), where the gradient occurs in the bin [0, 1[ as I mentioned in my previous comments. Here, (4 - (0+1)) % 4 = 3. Hopefully, you see now why there are 5 possible cases: for the zero symbol, when there is a small offset of the symbol (due to desynchronization), its gradient can end up somewhere between bin 3 (case b) and bin 0. There will be a gradient, but it will still be a zero symbol. The second case is where the symbol is aligned, and there is no gradient at all (case a).

If this is still not clear, I encourage you to plot the same graphs as above in your example, but instead of considering 1 sample per bin, you could take for example 8 samples per bin and assign the corresponding symbol values to the bins. Then desynchronize the LoRa chips by 2 samples and determine to which bin they belong. You will see that the zero symbol will have a gradient.

Considering the code, I think you are correct that in order to get the right gradient for case (e), the gradient should be calculated circularly. This would also remove the need for considering 2 separate cases for symbol 0. Right now it seems there will be a bit error if case (e) occurs. I probably didn't notice this bug because the error correction will correct for it and because the decoder used to work in a slightly different way in the past. Anyway, I don't have time to test this right now (I'm a bit busy at work), but after fixing this and then regenerating the whitening sequence the accuracy might be improved further. Thanks for reporting this!

On a final note, a paper was recently published by Bernier et al. where a synchronization algorithm is described for the FFT decoding method. I hope I can find the time somewhere to implement it, because it seems a much more stable method compared to what is currently implemented in gr-lora.

jkadbear commented 4 years ago

Yes, gradient based alignment is not usable under low SNR (<0). Our team have also proposed an up-down alignment method that is likely the actual implementation in LoRa chip. This method utilizes SFD part (down chirp) of a LoRa packet. Maybe it is the same method of Bernier et al.

rpp0 commented 4 years ago

I still need to read their paper in detail, but yes it might be similar. I tried to implement FFT-based decoding before but had some trouble with understanding the fine frequency synchronziation algorithm described in the patent. If you would be interested in adding your method to gr-lora, feel free to let me know. It would be nice to be able to decode signals with negative SNR :).

rpp0 / gr-lora

A bug in decoder implementation #99