rpp0 / gr-lora

GNU Radio blocks for receiving LoRa modulated radio messages using SDR
GNU General Public License v3.0
537 stars 115 forks source link

High packet error rate, same error vector almost every time #47

Closed g0hww closed 7 years ago

g0hww commented 7 years ago

Last night I was investigating a high packet error rate in the traffic received by lora_receive_realtime. I had been observing a PER of 30% with code prior to v0.6.2, and had been seeing a very predictable/repeatable error vector occurring in corrupted packets. After updating to v0.6.2 I am still seeing the same thing

The traffic I am monitoring is beacon traffic between my pair of LoPys. The traffic is sent using sockets configured like so

        self.lora = LoRa(mode=LoRa.LORA, tx_power = config.tx_dbm,
                         bandwidth=LoRa.BW_125KHZ,
                         sf = 8, preamble=12, coding_rate=LoRa.CODING_4_7,
                         frequency=config.lora_freq)
        self.lora_sock = socket.socket(socket.AF_LORA, socket.SOCK_RAW)

This is in the 868MHz band. TX Power is 2dBm. Peak Rx signal is -30dBFS on HackRF One, using a loft-mounted discone antenna and 20dB mast-head preamp. Both LoPys are in the house below the discone antenna. There is about 6-10dB difference in signal strength of the two LoPys, as seen by lora_receive_realtime. The noise floor is about 35dB below the peaks of the weaker LoPy. The channel is reasonably clear and I would expect a very low PER. In fact the pair of LoPys very, very rarely report reception of corrupted traffic. To summarise, the channel is not challenged.

FYI, I've started to use the "WX fosphor sync" in lora_receive_realtime, it really is nice for a good visualisation of the LoRa waveform. Checkout the keybd shortcuts in the docs pane of the properties dialog.

The traffic is normally HMAC-MD5 + IV + CIPHERTEXT, but for experimental purposes I have switched to HMAC-MD5 + b'\x00'*(16+31). In my code that listens to the output of the message socket sync in lora_receive_realtime, when the HMAC fails, I XOR the known message payload with the received message payload (i.e. everything but the first 3and last 2 bytes) and print the error vector. It is almost always the same (the highly rare exceptions being attributed to transient channel state changes).

Here is a snippet of what my code reports:

2017-09-15 14:27:21.005451*** TEST VECTOR *** (*** AUTHENTICATED ***)
2017-09-15 14:27:29.132665 - Rx: HMAC failed! len: 66
*** ERROR PATTERN REPEAT *** 2563
Error vector: b'000000000000000000000000000000000000000000808000000000000000000000000000000000000000000000000000000000000000000000000000000000'
Error vector match rate: 84.83945713339953%
Packet Loss Rate:0.2986358244365362

This shows one test message being received and passing HMAC checks. The next packet fails HMAC verification. It has the expected payload length (I ignore the length in the first byte of the 3 byte preamble, and now also the 2 byte CRC suffix). The code then claims that the calculated error vector is the same as the previous 2563 error vectors. The error vector match rate really should show 99.99% as there had only been one error vector that did not match this pattern and that reset a counter used in my naff stats calculation.

Anyway, at byte 22 in the error vector, the 9 bit long error burst 0b100000001 (0x808) occurs. There may be errors in the 3 byte preamble (and now the 2 byte CRC that I'm not seeing).

I'm not sure if there is anything else I can add, other than perhaps recording some baseband for you to try and reproduce this.

I have no idea if this might be related, but Murphys law probably means that the same thing will probably go wrong in all of the gnuradio software that I use ;P http://destevez.net/2017/07/degradation-bug-in-gnu-radio-decode-ccsds-27/

Thanks for your work on this, BTW.

P.S. I put my HackRF One inside a metal chassis in case there was anything funny going on with with the LoRa gadgets being closer in proximity to the HackRF than it's antenna system. That hasn't made any difference to the PER, but has quietened the spectrum a bit :)

g0hww commented 7 years ago

As an aside, do you know the CRC poly, etc. and scope? I have tried these and don't get a match, for a scope including (which I think is correct) or excluding the 3 byte prefix: crc16_ccitt = crcmod.Crc(0x11021, initCrc=0x1D0F, rev=True, xorOut=0x0000) crc16_ibm = crcmod.Crc(0x18005, initCrc=0xFFFF, rev=False, xorOut=0x0000)

rpp0 commented 7 years ago

Hi, thanks so much for your detailed analysis (and cool setup btw ;))! My guess is that this is a clock drift correction problem somewhere in the middle of the packet. There are a number of things that can cause this:

Personally, I would try the first option first and see how much your CPU can handle. If this doesn't solve your problem, it would be great if you could provide me with a .cfile of a signal that contains the error vector. I could take a look at the debug output and perhaps include the .cfile in the developer test suite for future regression tests.

Anyway, at byte 22 in the error vector, the 9 bit long error burst 0b100000001 (0x808) occurs. There may be errors in the 3 byte preamble (and now the 2 byte CRC that I'm not seeing).

I think an error in the header is unlikely, because this would corrupt the whole packet significantly.

I have no idea if this might be related, but Murphys law probably means that the same thing will probably go wrong in all of the gnuradio software that I use ;P

Haha :), interesting blog post. Seems like that was a hard one to track down!

As an aside, do you know the CRC poly, etc. and scope?

Yes, I have recently reverse engineered the CRC of the header and payload, but the payload CRC has a different xorout for every position. This most likely means that the CRC is not whitened, while it is currently dewhitened in gr-lora. I will fix this in the next version, but it requires either some architectural changes or a messy fix, so might take some time. I'm also more busy with other things next week, but thanks again for experimenting and reporting these things! This is really helpful.

g0hww commented 7 years ago

Just a quick note to say that I have tried a few things and have experienced a substantial improvement in packet loss rate.

I had already noticed before posting the issue that I hadn't configured a ppm offset for the HackRF One, and found that adding the ppm offset of 5.0 aligned the rx sigs nicely but had no effect on the PLR.

I tried increasing the internal_sampling_rate to 1M, but it didn't seem to help. I decided to try the USRP b100/wbx again, instead of the HackRF, but found no improvement. I experimented with some higher sample rates, observed some 'O' overflows for the first time and set the flowgraph for Realtime Scheduling, with 1M internal sample rate. I noticed 0 PLR for about an hour and then switched back to the HackRF. I never saw a single HMAC failure in a couple of hours of observation, so I left it running overnight. The results I saw this morning were:

Pkts recvd:7506 Pkts HMAC pass:7505 Pkts HMAC fail:1 Pkt loss rate:0.00013

Also note that this is with a test payload of b'\x5A'*(230) preceded by the HMAC-MD5. This is getting on for 4 times that payload size that I originally reported excessive loss with.

I'll continue to experiment to see if I can determine exactly what solved the issue. I have also ordered some RP-SMA to SMA adapters, so I can try hooking a Lopy up directly to the SDR via some attenuators.

rpp0 commented 7 years ago

I experimented with some higher sample rates, observed some 'O' overflows for the first time and set the flowgraph for Realtime Scheduling, with 1M internal sample rate.

I forgot to mention that the internal sample rate should be >= the sample rate of the SDR, but perhaps this was already the case? Anyway, glad you managed to get the PLR to 0.00013! That's a great result.

For the CRC I need to take back what I said. I noticed a very low entropy in the header CRC and it turns out that, contrary to what the datasheets state, the CRC is in fact a simple checksum that seemingly XORs random bits together. There was something strange going on with the payload CRC as well; couldn't figure it out. But it seems the guys at LimeSDR managed to find it. What strikes me is that the comment says CRC reverse engineered from Sx1272 data stream.. I wonder how they achieved this, as the CRC is XORed with another random LFSR and the payload, which makes it a difficult task. Perhaps one of the devs has insider information / access to hardware schematics? Anyway, I will ask whether I can use their code for the payload CRC.

rpp0 commented 7 years ago

I suppose this issue can be closed. Feel free to reopen should this not be the case.