mcci-catena / arduino-lmic

LoraWAN-MAC-in-C library, adapted to run under the Arduino environment
https://forum.mcci.io/c/device-software/arduino-lmic/
MIT License
650 stars 212 forks source link

How to set non standard RX2 setting for TTN with LMIC #455

Closed cyberman54 closed 5 years ago

cyberman54 commented 5 years ago

With one of latest commits after Sep 9th 2019 joining on EU868 (tested with TTNv2) is broken. Event "Join Wait" is endlessly reported. It seems the join accept packet by the gateway is somehow not properly processed, could be a timing or a logic problem.

Set back to commit #76f7bd5 clearly solves this issue for me.

So, i assume this issue came in with one of the commits later Sep 9th.

terrillmoore commented 5 years ago

Sorry you're having problems.

Are you by any chance using LMIC_setClockError()?

The change right after this (42da75b560c7f3fc4ec75b35da6864906dd1981c) radically changed the timing for SF12, which is critical for EU868 join. It turns out that the reason we have needed LMIC_setClockErorr is that it works around the error in the code. If you check #442, you'll see a description.

RedwoodComm and I have tested this extensively (on different hardware of course) and confirmed that the change greatly improves performance, but required that I remove the LMIC_setClockError hack. If this doesn't work for you, I may have to get one of your boards to see what is going on.

cyberman54 commented 5 years ago

I am not using LMIC_setClockError() in my code any more, after i switched to timestamping by interrupts (this was my PR). So, this issue probably is not triggered by LMIC_setClockError(). To check this out, i disabled interrupts and enabled LMIC_setClockError(), same result, no join.

I am using different ESP32 boards for my application. I don't think this is hardware dependent. But using ESP32 means running LMIC in an RTOS environment. I assume that this is causing subtile timing effects. So this likely is an application problem. But question is, what after #76f7bd5 did trigger this issue, because MCCI LMIC #76f7bd5 is working stable and smoothlessly in the ESP32 environment.

terrillmoore commented 5 years ago

My guess is that 42da75b broke things for you. You could also use git bisect to try to narrow down the problem commit. It could be something else I did. Ideally we'd get me a repro case. I'm traveling and so it will be hard for hardware to catch up with me; but if you can find something suitable available on Amazon, send me a link. Something with a U.FL or SMA is best for testing with the analyzer. I can get it and try to test. I've got my lab with me.

cyberman54 commented 5 years ago

I will do some timing experiments by changing task priorities in my application. Will keep you posted here.

terrillmoore commented 5 years ago

If you edit config.h, very end, turn on the logging option, and then run the compliance sketch, I will get detailed information about what's going wrong. You'll have to edit your keys for the compliance sketch, but I know from experience a few months ago that deveui=1/appeui=2/appkey=1 will prompt a JoinAccept from TTN. (That's why I use deveui=1/appeui=1/appkey=2 by default). The two receives will fail, but I'll get a lot more info.

If the compliance sketch works, but your sketch doesn't, the next step would be to port the logging code to your app and get the same printouts.

cyberman54 commented 5 years ago

Thanks for your hints. I reversed commit https://github.com/mcci-catena/arduino-lmic/commit/42da75b560c7f3fc4ec75b35da6864906dd1981c but this did not affect the join behaviour.

Meanwhile i can't reproduce the issue with current https://github.com/mcci-catena/arduino-lmic/commit/852f3489b6c9d016a1c9821833a595a9ca2aef71 . Maybe the latest commits made this happen, or it was no LMIC issue, but a TTN issue. Currently i don't see any JOIN ACCEPT messages in TTN console, thus i assume the software version on TTN site is not constant.

DeveloppeurPascal commented 5 years ago

Hi. I have the same issue on a big program. It worked two days ago, I don't remember what I changed, but now EV_JOIN_TXCOMPLETE and no EV_JOINED event. But the LoRaWan sample program works fine, so I think it's a delay problem somewhere in our code (or something LMIC don't intercept). For me it's not with TTN but LiveObjects (from Orange).

terrillmoore commented 5 years ago

Nothing has changed in the LMIC code in the last two days. Which LoRaWAN sample program do you mean?

cyberman54 commented 5 years ago

@developpeurpascal which platform? ESP32, ... ?

cyberman54 commented 5 years ago

Problem arised here again. It looks erratic, sometimes join works, sometimes join wait. I need to setup a reference environment to track this down finally.

cyberman54 commented 5 years ago

@terrillmoore i got the compliance script running stand alone on an ESP32 system with arduino-esp32 core (latest version 1.0.4).

Settings of lmic_project_config.h used:

// project-specific definitions

define CFG_eu868 1

//#define CFG_us915 1 //#define CFG_au921 1 //#define CFG_as923 1 // #define LMIC_COUNTRY_CODE LMIC_COUNTRY_CODE_JP / for as923-JP / //#define CFG_kr920 1 //#define CFG_in866 1

define CFG_sx1276_radio 1

//#define LMIC_USE_INTERRUPTS

Here is the resulting log, showing a failed join. I used the default keys you noted above. Can you see what's going wrong?

rst:0x1 (POWERON_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3fff0018,len:4
load:0x3fff001c,len:1044
load:0x40078000,len:8896
load:0x40080400,len:5828
entry 0x400806ac

------------------------------------------------------------------------------
main.cpp
LMIC version 3.0.99.5 configured for region 1.
Remember to select 'Line Ending: Newline' at the bottom of the monitor window.
------------------------------------------------------------------------------

calibration not supported
Packet queued
405018 (6480 ms): LMIC_setTxData_strict, datum=0x1000004. opmode=0
405033 (6480 ms): EV_JOINING
539562 (8632 ms): EV_TXSTART: ch=2 rps=0x01 (SF7 BW125 CR 4/5 Crc IH=0), datarate=5, opmode=88C, txend=539660, avail=0
543460 (8695 ms): radio_irq_handler_v2: LoRa, datum=0x8. opmode=88C
855194 (13683 ms): EV_RXSTART: freq=868.5 rps=0x81 (SF7 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=88C, txend=543455, avail=0, delta ms 4987
856876 (13710 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C
920254 (14724 ms): EV_RXSTART: freq=869.5 rps=0x86 (SF12 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=88C, txend=543455, avail=0, delta ms 6028
932690 (14923 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C
932695 (14923 ms): EV_JOIN_TXCOMPLETE, saveIrqFlags 0x80
4835448 (77367 ms): EV_TXSTART: ch=0 rps=0x01 (SF7 BW125 CR 4/5 Crc IH=0), datarate=5, opmode=88C, txend=4835553, avail=0
4839340 (77429 ms): radio_irq_handler_v2: LoRa, datum=0x8. opmode=88C
5151074 (82417 ms): EV_RXSTART: freq=868.1 rps=0x81 (SF7 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=88C, txend=4839336, avail=0, delta ms 4987
5152757 (82444 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C
5216134 (83458 ms): EV_RXSTART: freq=869.5 rps=0x86 (SF12 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=88C, txend=4839336, avail=0, delta ms 6028
5228569 (83657 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C
5228573 (83657 ms): EV_JOIN_TXCOMPLETE, saveIrqFlags 0x80
9279294 (148468 ms): EV_TXSTART: ch=1 rps=0x01 (SF7 BW125 CR 4/5 Crc IH=0), datarate=5, opmode=88C, txend=9279398, avail=0
9283185 (148530 ms): radio_irq_handler_v2: LoRa, datum=0x8. opmode=88C
9594919 (153518 ms): EV_RXSTART: freq=868.3 rps=0x81 (SF7 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=88C, txend=9283181, avail=0, delta ms 4987
9596602 (153545 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C
9659979 (154559 ms): EV_RXSTART: freq=869.5 rps=0x86 (SF12 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=88C, txend=9283181, avail=0, delta ms 6028
9672414 (154758 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C
9672418 (154758 ms): EV_JOIN_TXCOMPLETE, saveIrqFlags 0x80
14154216 (226467 ms): EV_TXSTART: ch=2 rps=0x02 (SF8 BW125 CR 4/5 Crc IH=0), datarate=4, opmode=88C, txend=14154321, avail=0
14161324 (226581 ms): radio_irq_handler_v2: LoRa, datum=0x8. opmode=88C
14473186 (231570 ms): EV_RXSTART: freq=868.5 rps=0x82 (SF8 BW125 CR 4/5 NoCrc IH=0), datarate=4, opmode=88C, txend=14161320, avail=0, delta ms 4989
14475125 (231602 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C
14538118 (232609 ms): EV_RXSTART: freq=869.5 rps=0x86 (SF12 BW125 CR 4/5 NoCrc IH=0), datarate=4, opmode=88C, txend=14161320,
avail=0, delta ms 6028
14550554 (232808 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C
14550557 (232808 ms): EV_JOIN_TXCOMPLETE, saveIrqFlags 0x80
cyberman54 commented 5 years ago

@terrillmoore I repeated the test with valid TTNv2 keys. Here are the results for a successful join, as well an (yet) unsuccessful "JOIN WAIT" join:

successful Join:

------------------------------------------------------------------------------
main.cpp
LMIC version 3.0.99.5 configured for region 1.
Remember to select 'Line Ending: Newline' at the bottom of the monitor window.
------------------------------------------------------------------------------

calibration not supported
Packet queued
405129 (6482 ms): LMIC_setTxData_strict, datum=0x1000004. opmode=0
405145 (6482 ms): EV_JOINING
716466 (11463 ms): EV_TXSTART: ch=2 rps=0x01 (SF7 BW125 CR 4/5 Crc IH=0), datarate=5, opmode=88C, txend=716565, avail=0
720363 (11525 ms): radio_irq_handler_v2: LoRa, datum=0x8. opmode=88C
1032098 (16513 ms): EV_RXSTART: freq=868.5 rps=0x81 (SF7 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=88C, txend=720359, avail=0, delta ms 4987
1037401 (16598 ms): radio_irq_handler_v2: LoRa, datum=0x40. opmode=88C
1037464 (16599 ms): EV_JOINED: ch=2
netid: 19
devaddr: 2601xxxx
artKey: <erased>
nwkKey: <erased>
1037497 (16599 ms): EV_TXSTART: ch=0 rps=0x01 (SF7 BW125 CR 4/5 Crc IH=0), datarate=5, opmode=888, txend=1037445, avail=0
1040747 (16651 ms): radio_irq_handler_v2: LoRa, datum=0x8. opmode=888
1102482 (17639 ms): EV_RXSTART: freq=868.1 rps=0x81 (SF7 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=888, txend=1040744, avail=0, delta ms 987
1104166 (17666 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=888
1165367 (18645 ms): EV_RXSTART: freq=869.5 rps=0x83 (SF9 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=888, txend=1040744, avail=0, delta ms 1993
1167818 (18685 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=888
1167826 (18685 ms): EV_TXCOMPLETE: ch=0 rps=0x83 (SF9 BW125 CR 4/5 NoCrc IH=0), txrxFlags=0x20, FcntUp=0001, FcntDn=0000, txend=1040744, avail=1199075
Packet queued
1480326 (23685 ms): LMIC_setTxData_strict, datum=0x1000004. opmode=900

unsuccesful Join / "JOIN WAIT":

------------------------------------------------------------------------------
main.cpp
LMIC version 3.0.99.5 configured for region 1.
Remember to select 'Line Ending: Newline' at the bottom of the monitor window.
------------------------------------------------------------------------------

calibration not supported
Packet queued
405149 (6482 ms): LMIC_setTxData_strict, datum=0x1000004. opmode=0
405166 (6482 ms): EV_JOINING
757355 (12117 ms): EV_TXSTART: ch=1 rps=0x01 (SF7 BW125 CR 4/5 Crc IH=0), datarate=5, opmode=88C, txend=757454, avail=0
761253 (12180 ms): radio_irq_handler_v2: LoRa, datum=0x8. opmode=88C
1072987 (17167 ms): EV_RXSTART: freq=868.3 rps=0x81 (SF7 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=88C, txend=761248, avail=0, delta ms 4987
1074670 (17194 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C
1138047 (18208 ms): EV_RXSTART: freq=869.5 rps=0x86 (SF12 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=88C, txend=761248, avail=0, delta ms 6028
1150482 (18407 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C
1150488 (18407 ms): EV_JOIN_TXCOMPLETE, saveIrqFlags 0x80
cyberman54 commented 5 years ago

but if you can find something suitable available on Amazon, send me a link. Something with a U.FL > or SMA is best for testing with the analyzer. I can get it and try to test. I've got my lab with me.

@terrillmoore this one would be suitable and available from amazon stock in USA.

terrillmoore commented 5 years ago

I see a problem. Pre-join, the LMIC is using the spec-compliant RX2==SF12 configuration. But TTN uses a non-standard RX2==SF9 configuration (see here).

I wonder (since this comes and goes) whether TTN is sometimes using RX1 for join accept, and sometimes using RX2. This would be based on congestion and other metrics, and it would not be surprising if this decision is made in advance for the next hour or so. When network is using RX1, join would work; when network is using RX2, join would not work.

(The reason for TTN choosing to run the network this way was to add capacity, and take better advantage of the channel for RX2; it doesn't have as strict a duty cycle limit as the RX1 channel.)

If so, we need to find a place to add LMIC_setDrTxpow(EU868_DR_SF9, KEEP_TXPOW); LMIC.dn2dr = EU868_DR_SF9); right after initialization (i.e. after the call to LMIC_reset() but before triggering the join).

terrillmoore commented 5 years ago

@cyberman54

this one would be suitable and available from amazon stock in USA.

Yes,. I have something very similar (the one that isn't stable, mentioned in #463). Maybe you can post your pin-map, BSP, and board-ID in the Arduino IDE, and I can compare to what I'm using? Then, if it all looks the same, I'll order a board.

cyberman54 commented 5 years ago

@terrillmoore You nailed it! :-)

I added LMIC_setDrTxpow(EU868_DR_SF9, KEEP_TXPOW) right after LMIC_reset() to the LMIC stack initialization of my application code. The annoying JOIN WAITs immediately disappeared.

Now the question is, how to properly deal with this non standard TTN setting for EU868 in the LMIC library?

cyberman54 commented 5 years ago

Hmm, looks like i was too optimistic. After a series of trials with two different ESP32 boards i still see some JOIN WAITs, but it seems that probability of successful joins is now increased.

terrillmoore commented 5 years ago

Please send logs as before. Also, look at the TTN console gateway log to see if they really sent the join accept through your gateway. If the network is busy, they might not send them right away. When I've had many devices on the same gateway joins took a while even with the old code. EU should be worse than the US, because of the lower downlink capacity.

cyberman54 commented 5 years ago

@terrillmoore I still see the same issue, joins sometimes fail. I tested with two different gateways (MatchX 1701 + TTIG) and two different ESP32 based boards. Results are the same. After power cycling the gateway, the next joins are immediately successful. But after a while the "JOIN WAIT" problem recurs. See logs enclosed.

To enforce SF9 for TTNv2 (EU868) i modified setupForNetwork() in the compliance script:

void setupForNetwork(bool preJoin)
{
#if CFG_LMIC_US_like
    LMIC_selectSubBand(0);
#endif

#if CFG_LMIC_EU_like
    LMIC_setDrTxpow(EU868_DR_SF9, KEEP_TXPOW);
#endif
}

=== Log 1: unsuccessful join ===

1. Gateway Join Request

{ "gw_id": "eui-40d63cfffe0xxxxx", "payload": "APOFANB+1bNw//7Nie4t5rQ3AX/yefk=", "dev_eui": "B4E62DEE89Cxxxxx", "lora": { "spreading_factor": 9, "bandwidth": 125, "air_time": 205824000 }, "coding_rate": "4/5", "timestamp": "2019-10-06T10:36:56.763Z", "rssi": -46, "snr": 12.8, "app_eui": "70B3D57ED00xxxxx", "frequency": 868100000 }

2. Gateway Join Accept

{ "gw_id": "eui-40d63cfffe0xxxxx", "payload": "INwb982r+yQ6003Su3FbzSKfAHdaU+oI0CYozRZz+RNC", "lora": { "spreading_factor": 9, "bandwidth": 125, "air_time": 246784000 }, "coding_rate": "4/5", "timestamp": "2019-10-06T10:37:00.765Z", "frequency": 868100000 }

3. LMIC compliance script log

------------------------------------------------------------------------------
main.cpp
LMIC version 3.0.99.5 configured for region 1.
Remember to select 'Line Ending: Newline' at the bottom of the monitor window.
------------------------------------------------------------------------------

calibration not supported
Packet queued
405206 (6483 ms): LMIC_setTxData_strict, datum=0x1000004. opmode=800
405223 (6483 ms): EV_JOINING
405253 (6484 ms): EV_TXSTART: ch=0 rps=0x03 (SF9 BW125 CR 4/5 Crc IH=0), datarate=3, opmode=88C, txend=405213, avail=0
418253 (6692 ms): radio_irq_handler_v2: LoRa, datum=0x8. opmode=88C
730377 (11686 ms): EV_RXSTART: freq=868.1 rps=0x83 (SF9 BW125 CR 4/5 NoCrc IH=0), datarate=3, opmode=88C, txend=418154, avail=0,
delta ms 4995
732752 (11724 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C
795002 (12720 ms): EV_RXSTART: freq=869.5 rps=0x86 (SF12 BW125 CR 4/5 NoCrc IH=0), datarate=3, opmode=88C, txend=418154, avail=0, delta ms 6029
807439 (12919 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C
807565 (12921 ms): EV_JOIN_TXCOMPLETE, saveIrqFlags 0x80

=== Log 2: immediately successful join after gateway reset ===

------------------------------------------------------------------------------
main.cpp
LMIC version 3.0.99.5 configured for region 1.
Remember to select 'Line Ending: Newline' at the bottom of the monitor window.
------------------------------------------------------------------------------

calibration not supported
Packet queued
405190 (6483 ms): LMIC_setTxData_strict, datum=0x1000004. opmode=800
405207 (6483 ms): EV_JOINING
405237 (6483 ms): EV_TXSTART: ch=0 rps=0x03 (SF9 BW125 CR 4/5 Crc IH=0), datarate=3, opmode=88C, txend=405198, avail=0
418253 (6692 ms): radio_irq_handler_v2: LoRa, datum=0x8. opmode=88C
730377 (11686 ms): EV_RXSTART: freq=868.1 rps=0x83 (SF9 BW125 CR 4/5 NoCrc IH=0), datarate=3, opmode=88C, txend=418139, avail=0,
delta ms 4995
746252 (11940 ms): radio_irq_handler_v2: LoRa, datum=0x40. opmode=88C
746424 (11942 ms): EV_JOINED: ch=0
netid: 19
devaddr: 260xxxxx
artKey: <erased>
nwkKey: <erased>
746456 (11943 ms): EV_TXSTART: ch=1 rps=0x03 (SF9 BW125 CR 4/5 Crc IH=0), datarate=3, opmode=888, txend=746406, avail=0
756877 (12110 ms): radio_irq_handler_v2: LoRa, datum=0x8. opmode=888
819001 (13104 ms): EV_RXSTART: freq=868.3 rps=0x83 (SF9 BW125 CR 4/5 NoCrc IH=0), datarate=3, opmode=888, txend=756791, avail=0,
delta ms 995
821439 (13143 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=888
881439 (14103 ms): EV_RXSTART: freq=869.5 rps=0x83 (SF9 BW125 CR 4/5 NoCrc IH=0), datarate=3, opmode=888, txend=756791, avail=0,
delta ms 1994
883939 (14143 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=888
884067 (14145 ms): EV_TXCOMPLETE: ch=1 rps=0x83 (SF9 BW125 CR 4/5 NoCrc IH=0), txrxFlags=0x20, FcntUp=0001, FcntDn=0000, txend=756791, avail=915316
Packet queued
1196626 (19146 ms): LMIC_setTxData_strict, datum=0x1000004. opmode=900
cyberman54 commented 5 years ago

Meanwhile we tested at a different location, with different board and gateway (Lorix one). Same effect: "JOIN WAIT", and after a gateway reset immediately join. This is clearly reproducable. I wonder what kind of effect can a reset of gateway have? Perhaps duty cycle budget refill?

terrillmoore commented 5 years ago

Yep, I’m sure resetting the gateway resets any local idea of the duty-cycle limitation.

So can we conclude that the LMIC is working properly (with the patch for RX2 matching on TTN?

cyberman54 commented 5 years ago

@terrillmoore I'm not sure yet. Since this effect came in suddenly after Sep 9th 2019, something somewhere seem to have changed, and now impacts the downlink performance.

Is there a way to get reports by LMIC stack if and when the radio handler has seen a downlink? I want to compare these timestamps with the log of the packet forwarder of my gateway, to ensure, that no packets are dropped or missed one the device's side.

terrillmoore commented 5 years ago

There are two cases, JoinAccept and other messages.

JoinAccept results in an EV_JOIN.

Other messages cause a call to the rreceive-message handler registered using LMIC_registerRxMessageCb(). This call happens even for port 0 messages (although they end up as null messages to t he receiver, you can figure out the MAC message if needed).

Hope this helps...

cyberman54 commented 5 years ago

@terrillmoore I tracked this down further. It seems in my app LMIC does not switch to SF9 pre-join, for some unknown reason. This forces RX2 joins in TTN with SF12, what would explain that the issue does not recur immediately after a gateway reset, because the reset "refills" the duty cycle limit of the gateway.

I'm using this code for switching to SF9 pre-join, and asserted that LMIC_setDrTxpow is executed.

// do the network-specific setup prior to join.
#if CFG_LMIC_EU_like
  LMIC_setDrTxpow(EU868_DR_SF9, KEEP_TXPOW);
#elif CFG_LMIC_US_like
  LMIC_selectSubBand(1);
#endif

But LMIC does not switch to SF9 pre-join:

852532: engineUpdate, opmode=0x4 [I][lorawan.cpp:313] onEvent(): JOINING 743687: engineUpdate, opmode=0x4 852708: TXMODE, freq=868300000, len=23, SF=7, BW=125, CR=4/5, IH=0 1161084: setupRx1 txrxFlags 00 --> 01 start single rx: now-rxtime: 6 1161189: RXMODE_SINGLE, freq=868300000, SF=7, BW=125, CR=4/5, IH=0 rxtimeout: entry: 1177539 rxtime: 1161179 entry-rxtime: 16360 now-entry: 9 rxtime-txend: 304596 1216152: setupRx2 txrxFlags 0x1 --> 02 start single rx: now-rxtime: 26 1216253: RXMODE_SINGLE, freq=869525000, SF=12, BW=125, CR=4/5, IH=0 rxtimeout: entry: 1263469 rxtime: 1216223 entry-rxtime: 47246 now-entry: 8 rxtime-txend: 359640 1263587: processRx2Jacc txrxFlags 0x2 --> 00 [I][lorawan.cpp:313] onEvent(): JOIN WAIT 1263914: engineUpdate, opmode=0x4

I am stuck here. Do you have any hint how i can proceed to get this fixed?

terrillmoore commented 5 years ago

I'm an idiot; I told you to do the wrong thing.

The code you want is:

LMIC.dn2dr = EU868_DR_SF9;

Sorry for the confusion.

cyberman54 commented 5 years ago

Thanks for the hint! I put LMIC.dn2Dr = EU868_DR_SF9 in my code prior join. But the device keeps trying to join with SF7:

926270: engineUpdate, opmode=0x4 926449: TXMODE, freq=868100000, len=23, SF=7, BW=125, CR=4/5, IH=0´

I assured that value of LMIC.dn2Dr is 3 (= EU868_DR_SF9) before calling os_runloop_once first time to kick off the join.

Sure, that it should work this way with OTAA (not ABP)?

cyberman54 commented 5 years ago

Something is still weird here. I tested different settings with the OTA compliance script. Modified setupForNetwork() as follows:

void setupForNetwork(bool preJoin) {

if CFG_LMIC_US_like

LMIC_selectSubBand(0);

endif

if CFG_LMIC_EU_like

//LMIC_setDrTxpow(EU868_DR_SF9, KEEP_TXPOW);
LMIC.dn2Dr = EU868_DR_SF9;

endif

}

If i use LMIC_setDrTxpow(EU868_DR_SF9, KEEP_TXPOW) i get a quick join (~ 5 sec.) with SF9:

404928 (6478 ms): LMIC_setTxData_strict, datum=0x1000004. opmode=800 404944 (6479 ms): EV_JOINING 404975 (6479 ms): EV_TXSTART: ch=0 rps=0x03 (SF9 BW125 CR 4/5 Crc IH=0), datarate=3, opmode=88C, txend=404935, avail=0 418002 (6688 ms): radio_irq_handler_v2: LoRa, datum=0x8. opmode=88C 730126 (11682 ms): EV_RXSTART: freq=868.1 rps=0x83 (SF9 BW125 CR 4/5 NoCrc IH=0), datarate=3, opmode=88C, txend=417998, avail=0, delta ms 4994 746002 (11936 ms): radio_irq_handler_v2: LoRa, datum=0x40. opmode=88C 746173 (11938 ms): EV_JOINED: ch=0

If i use LMIC.dn2Dr = EU868_DR_SF9 i get a slightly delayed (~ 6 sec) join with SF7 and sometimes "JOIN WAIT":

404916 (6478 ms): LMIC_setTxData_strict, datum=0x1000004. opmode=0 404933 (6478 ms): EV_JOINING 469966 (7519 ms): EV_TXSTART: ch=0 rps=0x01 (SF7 BW125 CR 4/5 Crc IH=0), datarate=5, opmode=88C, txend=469945, avail=0 473879 (7582 ms): radio_irq_handler_v2: LoRa, datum=0x8. opmode=88C 785628 (12570 ms): EV_RXSTART: freq=868.1 rps=0x81 (SF7 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=88C, txend=473875, avail=0, delta ms 4988 791003 (12656 ms): radio_irq_handler_v2: LoRa, datum=0x40. opmode=88C 791175 (12658 ms): EV_JOINED: ch=0

terrillmoore commented 5 years ago

So here's the way it should work, according to my understanding of the spec.

  1. the uplink starts at SF7 (DR5). The device then listens during RX1 at SF7, and during RX2 at LMIC.dn2dr (SF9 for TTN, SF12 for standard).

  2. The network sends either during RX1 or during RX2; it doesn't ever do both.

  3. The device has no control over whether the network uses RX1 or RX2.

  4. If the device doesn't see a JoinAccept, it tries again, occasionally decrementing the data rate (increasing the spreading factor). (The LMIC tries each "default" channel at DR5 (SF7), tries a second time on each channel, and then goes to DR4 (SF8), and repeats the cycle until you get to DR0(SF12). Compliance tests like this behavior. The LMIC, after trying twice on each channel at SF12, declares a "join failure" and restarts the cycle.)

  5. there are three default channels in EU.

  6. To figure out which window was used, check the number of EV_RXSTART after the EV_TXSTART. Both of your most recent logs show that the JoinAccept happened during RX1.

Without using something like a network tester, it's hard to be sure what's going on, because the network is a big variable. It can legally:

  1. send a JoinAccept during RX1 on the same channel with same SF as the Join message uplink
  2. send a JoinAccept during RX2 on the designated RX2 channel with the network-specific RX2 data rate (SF9 for TTN, SF12 for most other networks).
  3. not send a JoinAccept at all.

We have evidence that case 3 happens at least some of the time.

terrillmoore commented 5 years ago

We have evidence that case 3 happens at least some of the time.

One way to tease out some of the network behavior (on TTN) is to watch several console pages in a web browser.

  1. The device status page. The device keys will all change immediately when the JoinRequest is received by the network.

  2. The device traffic page. This (depending on how well TTN is working that day) shows you the JoinRequest (uplink) and scheduled JoinAccept (downlink). This doesn't tell you that that the JoinAccept actually got to the gateway! I've seen many times that this seemed to happen, but the gateway never got the message.

  3. The gateway traffic page. This will show you what the network thinks it asked the gateway to do. This is logically after step 2 (as encryption has been applied). But it only shows the network's intention. The packet can still get dropped, or delayed, on its way to the gateway, resulting in packet drops.

  4. the logs on the gateway itself. These can tell you (a) what came from upstream, and (b) what the gateway decided to do. However, how useful the logs are depend on which packet forwarder you have and what debug level has been enabled.

My understanding is that the gateway may not even know if a packet has been dropped from the TX queue, if it's too late or something else is going on (e.g., an uplink is happening from another device on the same channel -- since EU shares uplink and downlink channels, the gateway has to decide whether to discard TXs that happen to be schedule dwhile a packet is being received from another device.)

cyberman54 commented 5 years ago

Moved a small step forward here. Found out why in my code the setting RX2 == SF9 for ttn did not work. I called LMIC.dn2Dr = EU868_DR_SF9 right before LMIC_startJoining() but this can't work*. After i moved LMIC.dn2Dr = EU868_DR_SF9 to the event handler (put it to case JOINING, as found in the compliance script) i now see correct setting SF9 for RX2 in the log while joining. Perhaps this should be documented somewhere, otherwise probably most of TTN & LMIC users will use wrong RX2 setting.

So far, so good.

But the actual JOIN WAIT problem persists. Probably something caused by the TTN backend. But i still wonder what this triggered, because i didn't see this problem before 9th Sep 2019.

*) Because LMIC_startJoining is calling resetJoinParams() which overwrites

enum { DR_DNW2 = EU868_DR_SF12 }; [...] LMIC.dn2Dr = DR_DNW2;

cyberman54 commented 5 years ago

I think rest of the JOIN WAIT problem is depending on external factors, like TTN backend, network congestion and duty cycle. It's not a LMIC related issue. So we can keep this issue closed and detach the label "bug" from it.

terrillmoore commented 5 years ago

Thanks for the good work on RX2, I had overlooked that. Filed #474, will try to address this in a general way.

cyberman54 commented 5 years ago

Sorry for hitchhiking this thread. But any ideas who to analyze / solve the original "JOIN WAIT" problem? It still persists, and i am completely stuck. No more ideas where to look at. Any hints would be appreciated.

cyberman54 commented 5 years ago

@terrillmoore @manuelbl is it perhaps necessary to adjust timing constants in oslmic.h to run LMIC on the ardunio-esp32 platform?

e.g.

OSTICKS_PER_SEC or RX_RAMPUP

manuelbl commented 5 years ago

I rarely use the Arduino platform. But if I do, I don't change any timings. In my ESP-IDF based code, an entirely different hardware abstraction layer (HAL) is used, which cannot be blocked by user code. But again, I use the same timing constants incl. OSTICKS_PER_SEC, which could be easily changed.

I don't experience any problems with joining. However, my tests are limited to communication via my own gateway as I can't get a connection to other gateways (the closest ones are all indoors and with hills in-between).

manuelbl commented 5 years ago

@cyberman54 I've tried again with my gateway turned off and it seems I can reach another gateway after all. In the first test, I joined via my own gateway before turning it. That worked without problem.

In the second test, I turned off my gateway before starting. It then took quite some time until the join succeeded. It succeeded on the second attempt with data rate 1 (SF 11, BW 125kHz). In the TTN Console (Applications / ... / Data), I could see activation requests much earlier.

In both tests, I've used the ESP-IDF based library.

With my limited understanding of the protocol, I think it's working correctly. None of the tests reached a JOIN WAIT.

cyberman54 commented 5 years ago

@manuelbl Thanks for your notes. I will take a look at your code. Perhaps i can change my arduino-esp32 based code to native ESP-IDF.

@terrillmoore The OTAA compliance script logs millisecond deltas. What ranges of these values should be expected for stable downlink operation?

calibration not supported Packet queued 316918 (5070 ms): LMIC_setTxData_strict, datum=0x1000004. opmode=0 316932 (5070 ms): EV_JOINING 563966 (9023 ms): EV_TXSTART: ch=2 rps=0x01 (SF7 BW125 CR 4/5 Crc IH=0), datarate=5, opmode=88C, txend=563971, avail=0 567885 (9086 ms): radio_irq_handler_v2: LoRa, datum=0x8. opmode=88C 878134 (14050 ms): EV_RXSTART: freq=868.5 rps=0x81 (SF7 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=88C, txend=567881, avail=0, delta ms 4964 882822 (14125 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C 942822 (15085 ms): EV_RXSTART: freq=869.5 rps=0x86 (SF12 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=88C, txend=567881, avail=0, delta ms 5999 957322 (15317 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C 957448 (15319 ms): EV_JOIN_TXCOMPLETE, saveIrqFlags 0x80

manuelbl commented 5 years ago

@terrillmoore @cyberman54 BTW: I don't set RX2 to SF9 because of this sentence (see here):

If your devices use OTAA, this will be configured automatically when they join.

I think this implies that for joins either RX2 is not used for joins or that SF12 is used. After the join, it automatically goes to SF9.

Regarding the timing: Would it make sense to issue a warning if a callback with a sensitive timing is executed too late? It would be a valuable feedback is timing is an issue to improve or not.

cyberman54 commented 5 years ago

@manuelbl I think there are (at least) two typical root causes which can cause timing issues:

  1. tick timer is precise, but lmic misses ticks due to interrupts or multithreading (e.g. when running on RTOS) -> software platform dependent (in my case: arduino-esp32) and maybe app dependent (in my case: lots of multitasking besides LMIC)

  2. tick timer is not precise, what means micros() does not give real microseconds -> hardware platform dependent

I don't overlook yet the grade of impact 1) and/or 2) can have on LMIC.

Perhaps it is better to have LMIC run on dedicated CPU and wrap it in a highlevel "AT-command" API.

terrillmoore commented 5 years ago

@manuelbl -- TTN uses RX2 == SF9 even for JoinAccept. So dn2dr should be set when a join is attempted, otherwise the device won't hear RX2 JoinAccepts. But you're right, it should not be changed after a JoinAccept. RX2 will be set by the JoinAccept message.

Would it make sense to issue a warning if a callback is too late?

Yes. Increment a counter (and log "how late"). I would not print anything out, because that further distorts timing; I'd capture this via the log mecanism.

@cyberman54 In fact, after we release 3.1, I would like to look at using interrupts to trigger the following things:

  1. capture the tick timer for the end-of-transmit time
  2. launch opening the receive window.

(1) is easier than (2), because (2) requires that we be able to get access to the SPI bus at any time, interrupting any pending operation. This therefore involves customizing the Arduino SPI driver so that we can reserve the SPI bus when we know a real-time operation is coming. This is not hard to do but...

Here's the headache, how to i fund the effort for ESP32 platforms. I suppose I could start reselling an ESP32-based thing, but that's kind of a a backwards approach to this. It's a week or so up front, but then has to be supported forever. I didn't have a lot of luck getting anyone to support The Things Network NY based on my support of Windows for the Raspberry Pi; I suspect that people will mostly buy the cheapest thing they can find. Maybe we can find a volunteer to port my initial work done on MCCI's BSP to the ESP32 BSP? I can try to figure out a way to subclass the SPI driver, perhaps, so that if everybody uses the subclass it will work.

terrillmoore commented 5 years ago

@cyberman54 the timings for starting the receive window vary based on the spreading factor.

878134 (14050 ms): EV_RXSTART: freq=868.5 rps=0x81 (SF7 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=88C, txend=567881, avail=0, delta ms 4964 882822 (14125 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C 942822 (15085 ms): EV_RXSTART: freq=869.5 rps=0x86 (SF12 BW125 CR 4/5 NoCrc IH=0), datarate=5, opmode=88C, txend=567881, avail=0, delta ms 5999 957322 (15317 ms): radio_irq_handler_v2: LoRa, datum=0x80. opmode=88C

So: you're receiving on JoinAccept on SF7 during RX1, and SF12 during RX2. This won't work well for The Things Network.

As for timings: I have a fairly unintelligible spreadsheet that works out the delays for each spreading factor. Semtech uses the following at 125kHz (all timings in ms):

SF7 SF8 SF9 SF10 SF11 SF12
-8 -6 -2 +5 +17 +33

So if the window should nominally open at 5000 or 6000 ms, the actual values are.

SF7 SF8 SF9 SF10 SF11 SF12
4992 4994 4998 5005 5017 5033
5992 5994 5998 6005 6017 6033

The LMIC also applies the clock tolerance to this, making the window earlier. Since you see 4964 for SF7, I infer that your clock tolerance set to around 0.5% for this test.

However, we need to get your RX2 JoinAccept running at SF9....

manuelbl commented 5 years ago

@cyberman54 I have a hard time believing that the RTOS or the interrupt handling causes any problems on the ESP32. The LMIC code and the compliance script are a trivial load for the ESP32. If it works for an ATmega, it should easily work for the ESP32 too. For those people sticking with the Arduino framework, I don't see any reason for creating separate code for the ESP32.

Do you have JOIN_WAIT problems with the unmodified compliance script? If so, which branch of LMIC are you using? I'd like to reproduce it. And what boards are you using? (I have quite a collection of ESP32 board with integrated or external LoRa modules; I'm more restricted regarding gateways.)

cyberman54 commented 5 years ago

@manuelbl i can clearly reproduce the JOIN_WAIT problem with the unmodified compliance script, running with LMIC from master branch, current commit 5322dd1 . That's why i'm posting this here, otherwise i would assume an application related issue.

The problem seems to be board hardware independent, but it could depend on a specific local duty cycle / radio / gateway / TTN situation. I did not yet spent time on a reasonable measurement.

My testing environments, all and only TTN on EU868, so far: boards: Heltec Lora32 v1+v2, TTGO T-Beam v7+v10, other TTGOs, home brewn board gateways: TTIG, MatchX1707, Kerlink Wirnet station, Tektelics indoor (a and RAK coming this week)

cyberman54 commented 5 years ago

@terrillmoore thanks for your - again - very helpful and clear notes.

Join in RX2 on SF9 works by applying the LMIC.dn2Dr = EU868_DR_SF9 patch in the event handler. I removed this patch after i saw replies that say it is non standard and join should work without. Now i will apply it again. Anyway, this does unfortunately not solve the JOIN_WAIT problem.

I totally agree with you on your thoughs how to fund support for an ESP32 native LMIC port. In our commercial application project we decided to not progress with ESP32, but to use ARM/Cortex SoCs dedicated to the communication stack, like MCCI does with Catena boards.