Downlink message fails: class A has no more alternate TX time

dave-settle commented 2 years ago

If a basicstation process is (re)started, but receives no downlink message within the first hour or so, it will permanently refuse to transmit downlink messages (e.g. Join Accept).

The situation is described clearly in https://github.com/TheThingsNetwork/lorawan-stack/issues/2097, although the problem of course may be different.

To reproduce: restart basicstation and ensure that it receives no traffic for at least 90 minutes. Then start an end node - the join request is received, sent to the back-end (TTN in my case), and the back-end responds with a join accept. However, basicstation will refuse to transmit this to the end node with the error "class A has no more alternate TX time". The error appears to be permanent and no further downlink messages are ever transmitted.

Workaround: restart basicstation and get the end node to send a join request within 30 minutes. Everything will be processed normally and the join accept downlink message will be transmitted to the end node. Note: once a single transmission has been received, basicstation seems to be quite happy with long delays between transmissions - it looks like an issue only for the first request.

beitler commented 2 years ago

Hi @dave-settle,

Thanks for submitting the issue.

Which network server do you use? Why do you think the issue is related to Basics Station code rather than the network server implementation?

dave-settle commented 2 years ago

I'm using The Things Network, and I can see the join request arriving at TTN and a join accept being sent back to my basicstation process.I see the join accept message arrive at my basicstation process from TTN, but basicstation refuses to transmit it. My end node then times out waiting for the response. This happens endlessly.

If I restart basicstation and start up my end node again promptly, everything works as expected.It appears to be a problem if the first message is not received within a certain time when basicstation starts up.

beitler commented 2 years ago

Thanks for the clarification. I suggest you submit this to the TTN issue tracker. Logs would definitely help (especially the exact xtime values of the uplink and the corresponding downlink, as well as the Basics Station log messages with -l XDEBUG.

dave-settle commented 2 years ago

Hi Anton,

I don’t want to submit an issue to TTN at this stage, as it does not appear to be their problem. The downlink message is sent by TTN and arrives at Basic Station, but is then discarded:

Extracts from the Basic Station log below which will hopefully demonstrate the problem.

Start Basic Station, no RX for over 3 hours – downlink messages received from TTN but not transmitted

2021-10-24 06:55:02.357 [RAL:VERB] Concentrator started (3s62ms) … (a long time with no LoRa messages received) 2021-10-24 10:38:06.192 [S2E:VERB] RX 868.5MHz DR5 SF7/BW125 snr=-7.0 rssi=-121 xtime=0x1500031DE3A3D3 - jreq MHdr=00 JoinEui=::0 DevEui=a861:a32:3743:8b01 DevNonce=13572 MIC=-746395846 2021-10-24 10:38:08.019 [S2E:VERB] ::1 diid=9587 [ant#0] - class A has no more alternate TX time 2021-10-24 10:38:20.296 [S2E:VERB] RX 868.1MHz DR5 SF7/BW125 snr=-5.8 rssi=-123 xtime=0x1500031EBA9A3B - jreq MHdr=00 JoinEui=::0 DevEui=a861:a32:3743:8b01 DevNonce=31913 MIC=1421060471 2021-10-24 10:38:22.124 [S2E:VERB] ::1 diid=9622 [ant#0] - class A has no more alternate TX time

The TTN logs at around this time show repeated join request followed by acceptance (exact time not available, note one-hour difference due to GMT / UTC):

Restart the Basic Station and this time get an end node transmitting ASAP – everything works fine:

2021-10-24 11:12:24.012 [RAL:VERB] Concentrator started (3s63ms) … 2021-10-24 11:13:11.426 [S2E:VERB] RX 868.3MHz DR5 SF7/BW125 snr=9.8 rssi=-73 xtime=0xD1000002F9EFF3 - jreq MHdr=00 JoinEui=::0 DevEui=a861:a32:3743:8b01 DevNonce=30501 MIC=-673706514 2021-10-24 11:13:13.253 [S2E:DEBU] ::1 diid=13698 [ant#0] - next TX start ahead by 3s164ms 2021-10-24 11:13:16.397 [S2E:VERB] ::1 diid=13698 [ant#0] - starting TX in 19ms892us 2021-10-24 11:13:16.422 [S2E:INFO] TX ::1 diid=13698 [ant#0] - dntxed: 868.3MHz 16.0dBm ant#0(0) DR5 SF7/BW125 frame=20A70497EAFB8D3E5186E16F..97B1C4D0 2021-10-24 11:13:16.489 [S2E:DEBU] Tx done diid=13698

Corresponding TTN logs showing data uplink message received from end node (again, note one-hour time difference):

I can send you more detailed Basic Station logs if you wish, but the one that I have is attached.

beitler commented 2 years ago

Thanks for the logs. However, the log level is not detailed enough to extract the required information. Please run station with -l XDEBUG and -p parameters. Also, please have a look at these issues: https://github.com/lorabasics/basicstation/issues/103 https://github.com/lorabasics/basicstation/issues/87 and try to rule out any timing related issues by increasing the TX_AIM_GAP value for example to 50ms. You can do this via an environment variable:

TX_AIM_GAP='"50ms"' ./station -l XDEBUG -L - -p

dave-settle commented 2 years ago

Unfortunately I do not think that increasing the value of TX_AIM_GAP will help, for the following reasons:

The error message must be generated from the call to altTxTime at s2e.c line 535, as this is the only call to the function which is not followed by an additional error message
If so, then txtime has already been found to be LESS than earliest (i.e. now + TX_AIM_GAP), otherwise the altTxTime funtion would not have been called (left-to-right evaluation rules)
Increasing the value of earliest will not help if this is the case - although decreasing the value might

I'm not an expert on the protocol requirements, but it is not clear to me why the approach of "if too late, try alternative TX times, if out of alternatives drop it" has been selected. It might be better to choose the approach "if too late, try alternative TX times, if out of alternatives send it immediately anyway".

I will add a further trace statement at s2e.c line 536 to show the values of txtime and earliest, and try to reproduce the problem. This will give some idea of whether the problem could be solved by reducing TX_AIM_GAP or whether there is something going on with time synchronisation.

yannickgagne commented 1 year ago

Hi. I just wanted to know if there is anything new on this issue since I'm experiencing the same thing on my gateway.

RPi3+RAK hat+RAK2287.

Thanks

dave-settle commented 1 year ago

Nobody seems interested in fixing the problem, and I haven't had time to trace the root cause myself.

However, there is a workaround: make sure that your gateway receives an incoming message shortly after it starts up. I keep a node in the house so that the gateway receives regular messages, and this seems to keep it happy. Edit: Shortly means within about 20 minutes, in my case.

beitler commented 1 year ago

Hi @yannickgagne, are you observing this behavior on TTN only? Or do you observe this on another LNS?

@dave-settle, thank you for your comment. The issue https://github.com/TheThingsNetwork/lorawan-stack/issues/2097 describes in detail the underlying root cause why this issue appears. Given that you observe the same issue, it seems that TTN did not fix this issue yet. Although, to confirm this assumption we would need visibility into the exact protocol messages which you get with the XDEBUG log level. You wrote

I don’t want to submit an issue to TTN at this stage, as it does not appear to be their problem. The downlink message is sent by TTN and arrives at Basic Station, but is then discarded [...]

As described in the issue above, the reason for the downlink to be discarded is because the downlink timestamp populated by the LNS is not correct. Actually, we improved the logging information in version 2.0.6 (released January 2022), so the error message now should be more informative.

I will try to reproduce the issue but I would appreciate if you could also run the check on your side using station version 2.0.6 and XDEBUG log level. Thanks a lot for your support!

yannickgagne commented 1 year ago

@beitler TTN only yes. Never tested elsewhere.

lorabasics / basicstation

Downlink message fails: class A has no more alternate TX time #128