Open dave-settle opened 2 years ago
Hi @dave-settle,
Thanks for submitting the issue.
Which network server do you use? Why do you think the issue is related to Basics Station code rather than the network server implementation?
I'm using The Things Network, and I can see the join request arriving at TTN and a join accept being sent back to my basicstation process.I see the join accept message arrive at my basicstation process from TTN, but basicstation refuses to transmit it. My end node then times out waiting for the response. This happens endlessly.
If I restart basicstation and start up my end node again promptly, everything works as expected.It appears to be a problem if the first message is not received within a certain time when basicstation starts up.
Thanks for the clarification. I suggest you submit this to the TTN issue tracker. Logs would definitely help (especially the exact xtime
values of the uplink and the corresponding downlink, as well as the Basics Station log messages with -l XDEBUG
.
Hi Anton,
I don’t want to submit an issue to TTN at this stage, as it does not appear to be their problem. The downlink message is sent by TTN and arrives at Basic Station, but is then discarded:
Extracts from the Basic Station log below which will hopefully demonstrate the problem.
Start Basic Station, no RX for over 3 hours – downlink messages received from TTN but not transmitted
2021-10-24 06:55:02.357 [RAL:VERB] Concentrator started (3s62ms) … (a long time with no LoRa messages received) 2021-10-24 10:38:06.192 [S2E:VERB] RX 868.5MHz DR5 SF7/BW125 snr=-7.0 rssi=-121 xtime=0x1500031DE3A3D3 - jreq MHdr=00 JoinEui=::0 DevEui=a861:a32:3743:8b01 DevNonce=13572 MIC=-746395846 2021-10-24 10:38:08.019 [S2E:VERB] ::1 diid=9587 [ant#0] - class A has no more alternate TX time 2021-10-24 10:38:20.296 [S2E:VERB] RX 868.1MHz DR5 SF7/BW125 snr=-5.8 rssi=-123 xtime=0x1500031EBA9A3B - jreq MHdr=00 JoinEui=::0 DevEui=a861:a32:3743:8b01 DevNonce=31913 MIC=1421060471 2021-10-24 10:38:22.124 [S2E:VERB] ::1 diid=9622 [ant#0] - class A has no more alternate TX time
The TTN logs at around this time show repeated join request followed by acceptance (exact time not available, note one-hour difference due to GMT / UTC):
Restart the Basic Station and this time get an end node transmitting ASAP – everything works fine:
2021-10-24 11:12:24.012 [RAL:VERB] Concentrator started (3s63ms) … 2021-10-24 11:13:11.426 [S2E:VERB] RX 868.3MHz DR5 SF7/BW125 snr=9.8 rssi=-73 xtime=0xD1000002F9EFF3 - jreq MHdr=00 JoinEui=::0 DevEui=a861:a32:3743:8b01 DevNonce=30501 MIC=-673706514 2021-10-24 11:13:13.253 [S2E:DEBU] ::1 diid=13698 [ant#0] - next TX start ahead by 3s164ms 2021-10-24 11:13:16.397 [S2E:VERB] ::1 diid=13698 [ant#0] - starting TX in 19ms892us 2021-10-24 11:13:16.422 [S2E:INFO] TX ::1 diid=13698 [ant#0] - dntxed: 868.3MHz 16.0dBm ant#0(0) DR5 SF7/BW125 frame=20A70497EAFB8D3E5186E16F..97B1C4D0 2021-10-24 11:13:16.489 [S2E:DEBU] Tx done diid=13698
Corresponding TTN logs showing data uplink message received from end node (again, note one-hour time difference):
I can send you more detailed Basic Station logs if you wish, but the one that I have is attached.
Thanks for the logs. However, the log level is not detailed enough to extract the required information. Please run station with -l XDEBUG
and -p
parameters.
Also, please have a look at these issues:
https://github.com/lorabasics/basicstation/issues/103
https://github.com/lorabasics/basicstation/issues/87
and try to rule out any timing related issues by increasing the TX_AIM_GAP
value for example to 50ms
. You can do this via an environment variable:
TX_AIM_GAP='"50ms"' ./station -l XDEBUG -L - -p
Unfortunately I do not think that increasing the value of TX_AIM_GAP will help, for the following reasons:
I'm not an expert on the protocol requirements, but it is not clear to me why the approach of "if too late, try alternative TX times, if out of alternatives drop it" has been selected. It might be better to choose the approach "if too late, try alternative TX times, if out of alternatives send it immediately anyway".
I will add a further trace statement at s2e.c line 536 to show the values of txtime and earliest, and try to reproduce the problem. This will give some idea of whether the problem could be solved by reducing TX_AIM_GAP or whether there is something going on with time synchronisation.
Hi. I just wanted to know if there is anything new on this issue since I'm experiencing the same thing on my gateway.
RPi3+RAK hat+RAK2287.
Thanks
Nobody seems interested in fixing the problem, and I haven't had time to trace the root cause myself.
However, there is a workaround: make sure that your gateway receives an incoming message shortly after it starts up. I keep a node in the house so that the gateway receives regular messages, and this seems to keep it happy. Edit: Shortly means within about 20 minutes, in my case.
Hi @yannickgagne, are you observing this behavior on TTN only? Or do you observe this on another LNS?
@dave-settle, thank you for your comment. The issue https://github.com/TheThingsNetwork/lorawan-stack/issues/2097 describes in detail the underlying root cause why this issue appears. Given that you observe the same issue, it seems that TTN did not fix this issue yet. Although, to confirm this assumption we would need visibility into the exact protocol messages which you get with the XDEBUG
log level.
You wrote
I don’t want to submit an issue to TTN at this stage, as it does not appear to be their problem. The downlink message is sent by TTN and arrives at Basic Station, but is then discarded [...]
As described in the issue above, the reason for the downlink to be discarded is because the downlink timestamp populated by the LNS is not correct. Actually, we improved the logging information in version 2.0.6 (released January 2022), so the error message now should be more informative.
I will try to reproduce the issue but I would appreciate if you could also run the check on your side using station version 2.0.6 and XDEBUG
log level. Thanks a lot for your support!
@beitler TTN only yes. Never tested elsewhere.
If a basicstation process is (re)started, but receives no downlink message within the first hour or so, it will permanently refuse to transmit downlink messages (e.g. Join Accept).
The situation is described clearly in https://github.com/TheThingsNetwork/lorawan-stack/issues/2097, although the problem of course may be different.
To reproduce: restart basicstation and ensure that it receives no traffic for at least 90 minutes. Then start an end node - the join request is received, sent to the back-end (TTN in my case), and the back-end responds with a join accept. However, basicstation will refuse to transmit this to the end node with the error "class A has no more alternate TX time". The error appears to be permanent and no further downlink messages are ever transmitted.
Workaround: restart basicstation and get the end node to send a join request within 30 minutes. Everything will be processed normally and the join accept downlink message will be transmitted to the end node. Note: once a single transmission has been received, basicstation seems to be quite happy with long delays between transmissions - it looks like an issue only for the first request.