mcci-catena / arduino-lmic

LoraWAN-MAC-in-C library, adapted to run under the Arduino environment
https://forum.mcci.io/c/device-software/arduino-lmic/
MIT License
636 stars 207 forks source link

LMIC asserts after too long trying to join #1

Closed terrillmoore closed 7 years ago

terrillmoore commented 7 years ago

If you run the lmic code for long enough while configured for US915 without being able to join, it will assert: FAILURE {somepath}/src/lmic/radio.c:452

Analysis of the code shows that it's failing because radio.c thinks it's been asked to transmit on FSK, and things are not set up for FSK (since there's no FSK here).

The join code continually lowers the data rate on every join failure.

I speculate that lowerDR() (from lorabase.h) is getting fooled by the data-rate lowering code into choosing an invalid/untested path, due to something from the conditional compiles.

terrillmoore commented 7 years ago

The problem appears to be that the lmic code increments LMIC.txCnt, and doesn't clear it until there's a success. This is OK, except for the following line in the US915 code:

s1_t dr = DR_SF7 - ++LMIC.txCnt;
if( dr < DR_SF10 ) {
    dr = DR_SF10;
    failed = 1; // All DR exhausted - signal failed
}

Since LMIC.txCnt is not reset during a join, and is therefore growing indefinitely, sooner or later it will overflow a signed subtract -- and indeed, the failure happens after 128 prints of "EV_JOIN_FAILED". Furthermore, this explains the fairly random choice of CRs and sporadic prints of EV_JOIN_FAILED after the first one.

The solution appears to be to duplicate some of the logic initJoinLoop() from onJoinFailed(), at least in the us915 version: 1) set LMIC.adrTxPow to 20 2) setDrJoin(DRCHG_SET, DR_SF7);

And also reset LMIC.txCnt to zero.

terrillmoore commented 7 years ago

According to LoRaWAN Regional Specs for US915 (page 12, section 2.2.2, line 28):

If using the over-the-air activation procedure, the end-device should broadcast the JoinReq message alternatively on a random 125 kHz channel amongst the 64 channels defined using DR0 and a random 500 kHz channel amongst the 8 channels defined using DR4. The end device should change channel for every transmission.

The LMIC code doesn't do this. It drops power from SF7/125 (DR4) to SF10/125 (DR0) on each channel change, then declares a failure after getting to DR0 without a successful join. The "Join failed" indication is an artifact of this code; there's no requirement in the spec to report a join failure to the application.

So in fact, we should change the code to always use DR_SF10 for joining, and that will take out the logic that is decrementing the data rate and causing the overflow. Since join doesn't change adrTxPow, there's no need to reset it.

terrillmoore commented 7 years ago

Fixed on master by https://github.com/mcci-catena/arduino-lmic/commit/fc9494c4f3693de5e27faa94c52ce82e37122156. Fixed originally by https://github.com/mcci-catena/arduino-lmic/commit/18a05f1acc931e22ba55def5ac573be3d786229e