Closed cblack-f3 closed 3 years ago
Thanks for the well structured description. I am sure we will find out what is going on in your setup.
Judging by the logs, your frame is definitely placed into the radio TX buffer, but the TX status verification is failing. These are time critical operations. from the high level description is sounds to me that the TX timing parameters are not properly tuned to the expected SPI delays for your gateway:
To find out, please do the following:
station -p
and post them here.XDEBUG
loglevel. Please post the logs here.Thanks for the quick reply: station -p
str RADIODEV = "/dev/spidev?.0" builtin default radio device
u4 LOGFILE_SIZE = "10MB" builtin default size of a logfile
u4 LOGFILE_ROTATE = 3 builtin besides current log file keep *.1..N (none if 0)
u4 TCP_KEEPALIVE_EN = 1 builtin TCP keepalive enabled
u4 TCP_KEEPALIVE_IDLE = 60 builtin TCP keepalive TCP_KEEPIDLE [s]
u4 TCP_KEEPALIVE_INTVL = 15 builtin TCP keepalive TCP_KEEPINTVL [s]
u4 TCP_KEEPALIVE_CNT = 4 builtin TCP keepalive TCP_KEEPCNT
u4 MAX_JOINEUI_RANGES = 10 builtin max ranges to suppress unwanted join requests
ustime CUPS_CONN_TIMEOUT = "60s" builtin connection timeout
ustime CUPS_OKSYNC_INTV = "24h" builtin regular check-in with CUPS for updates
ustime CUPS_RESYNC_INTV = "1m" builtin check-in with CUPS for updates after a failure
u4 CUPS_BUFSZ = "8KB" builtin read from CUPS in chunks of this size
ustime GPS_REPORT_DELAY = "120s" builtin delay GPS reports and consolidate
ustime GPS_REOPEN_TTY_INTV = "1s" builtin recheck TTY open if it failed
ustime GPS_REOPEN_FIFO_INTV = "1s" builtin recheck if FIFO writer fake GPS
ustime CMD_REOPEN_FIFO_INTV = "1s" builtin recheck if FIFO writer
ustime RX_POLL_INTV = "20ms" builtin interval to poll SX1301 RX FIFO
ustime TC_TIMEOUT = "2s" station.conf reconnected to muxs
ustime CLASS_C_BACKOFF_BY = "100ms" builtin retry interval for class C TX attempts
u4 CLASS_C_BACKOFF_MAX = 10 builtin max number of class C TX attempts
ustime RADIO_INIT_WAIT = "200ms" builtin max wait for radio init command to finish
ustime PPS_VALID_INTV = "10m" builtin max age of last PPS sync for GPS time conversions
ustime TIMESYNC_RADIO_INTV = "2100ms" builtin interval to resync MCU/SX1301
ustime TIMESYNC_LNS_RETRY = "71ms" builtin resend timesync message to server
ustime TIMESYNC_LNS_PAUSE = "5s" builtin pause after unsuccessful volley of timesync messages
u4 TIMESYNC_LNS_BURST = 10 builtin volley of timesync messages before pausing
ustime TIMESYNC_REPORTS = "5m" builtin report interval for current timesync status
ustime TX_MIN_GAP = "10ms" builtin min distance between two frames being TXed
ustime TX_AIM_GAP = "20ms" builtin aim for this TX lead time, if delayed should not fall under min
ustime TX_MAX_AHEAD = "600s" builtin maximum time message can be scheduled into the future
ustime TXCHECK_FUDGE = "5ms" builtin check radio state this time into ongoing TX
ustime BEACON_INTVL = "128s" builtin beaconing interval
u4 TLS_SNI = true builtin Set and verify server name of TLS connections
I will work on libloragw in DEBUG and using a spec an to see if the packet actually sent. I would say it is not, since our end device sees a frame on 100% of the success cases, and does not see the frame on 100% of the 'frame not emitted' cases
@beitler Here is a logfile again, with XDEBUG. I've been working with Multitech on the debug libloragw. They sent me a new ipk for station, let me know if this has the appropriate level. This one failed on the 9th frame I sent down. I added a space between each frame for readability
Also working on that spec an - I need to borrow one from the office so might be a few days
I've hooked up an envelope follower to my scope
When it succeeds, I get this:
When I produce the error condition, there is NO RF activity.
Thank you for the details.
Let me walk through each line of the relevant part of the log:
INFO: tx_start_delay=1497 (1497.000000) - (1497, bw_delay=0.000000, notch_delay=0.000000)
The HAL prints this this line as part of the lgw_send setup sequence:
https://github.com/Lora-net/lora_gateway/blob/a955619271b5d0a46d32e08150acfbc1eed183b7/libloragw/src/loragw_hal.c#L1414
https://github.com/Lora-net/lora_gateway/blob/a955619271b5d0a46d32e08150acfbc1eed183b7/libloragw/src/loragw_hal.c#L407
Although this line is printed first, it does not mean that it happened chronologically before the next log lines. The reason is that the HAL uses printf
directly for logging. Station caches the log lines in a buffer and emits the buffer once full or timeout reached in order to reduce IO overhead. That's why station logs are slightly delayed.
2021-02-08 18:32:51.516 [AIO:XDEB] [3|WS] < {"msgtype":"dnmsg","DevEui":"01-01-01-01-01-01-01-01","dC":2,"diid":13086,"pdu":"606d23dc000a0a0003000000700331ffff01c7fff68e2aafa3","priority":1,"RX2DR":10,"RX2Freq":923900000,"xtime":49258120969134059,"rctx":0}
This is the Websocket reporting that a downlink message came in. Notice, that a class C unsolicited downlink does not require xtime
. It generates a warning later on.
2021-02-08 18:32:51.516 [S2E:WARN] RxDelay mapped to 1 as it was not present!
RXDelay parameter was not passed, so 1 second is assumed.
2021-02-08 18:32:51.516 [SYN:ERRO] Cannot convert xtime=0xAF000002BA622B - obsolete session: 175 (current 50)
This is the xtime related warning. Basically the timestamp cannot be properly mapped to any time reference and cannot be converted to radio time. However, since class C unsolicited downlinks are scheduled 'asap' the xtime value is ignored anyway and this warning does not have an effect on the downlink processing.
2021-02-08 18:32:51.516 [S2E:VERB] 101:101:101:101 diid=13086 [ant#0] - starting TX in 19ms694us
The received dnmsg
generated a TX job which is scheduled to happen in about 20ms
which corresponds to TX_AIM_GAP
. This basically means that station schedules this downlink as soon as possible.
2021-02-08 18:32:51.541 [S2E:ERRO] 101:101:101:101 diid=13086 [ant#0] - radio is not emitting frame - abandoning TX, trying alternative
5ms
into the ongoing transmission (TXCHECK_FUDGE
) station retrieves the radio tx status and expects it to be in TXSTATUS_EMITTING
.
https://github.com/lorabasics/basicstation/blob/bd17e53ab1137de6abb5ae48d6f3d52f6c268299/src/s2e.c#L635
The error message indicates that the radio was found not to be in TXSTATUS_EMITTING
state.
2021-02-08 18:32:51.641 [S2E:ERRO] 101:101:101:101 diid=13086 [ant#0] - radio is not emitting frame - abandoning TX, trying alternative
The retry happens 100ms
later and failing for the same reason.
Unfortunately, I was not able to reproduce the problem on the hardware that is available to me.
Given that in your case this issue occurs sporadically, it is probably related so a timing problem varying with a jitter.
I propose you adjust the relevant TX timing parameters a bit and see if the problem occurs more or less often.
Specifically, please try the following values in various combinations:
TX_AIM_GAP
--> 20ms, 30ms, 50ms
TXCHECK_FUDGE
--> 5ms, 10ms, 20ms
These values can be set inside station.conf.
The following flow diagram may help to understand how these values work together:
Also, it would be good to print out the actual tx_status of the radio at the point where EMITTING
is expected. This would require adding a corresponding printf statement either in the HAL:
https://github.com/Lora-net/lora_gateway/blob/a955619271b5d0a46d32e08150acfbc1eed183b7/libloragw/src/loragw_hal.c#L1636
or in station:
https://github.com/lorabasics/basicstation/blob/bd17e53ab1137de6abb5ae48d6f3d52f6c268299/src/ral_lgw.c#L236
Thanks for the very descriptive reply! I will start on this right away and ping MT about the rebuild with some extra debug statements.
Seeing that no frame is seen on the RF detector, would you expect "TXCHECK_FUDGE" to have an impact? I know I am getting beyond my understanding of the internals, but if this error is that we are checking at the wrong time, it would seem like I would still see the packet at the RF out.
Would this patch file work to output the status, or would you need it to be more specific?
`--- ral_lgw.c 2021-02-11 11:20:53.683043806 -0600 +++ ral_lgw_new.c 2021-02-11 11:29:49.287054602 -0600 @@ -240,6 +240,7 @@
int err = lgw_status(TX_STATUS, &status);
@pferland-multitech Indeed, this patch would help to visualize the radio status when station expects the EMITTING state.
Also, what struck me in the log is that the retries are scheduled only 100ms
apart while it should be 120ms
(CLASS_C_BACKOFF_BY
+ TX_AIM_GAP
). Which would mean that TX_AIM_GAP
is not honoured. It turns out the s2e tx state machine has a bug which causes this behavior. The patch below fixes that:
diff --git a/src/s2e.c b/src/s2e.c
index f8fd7f5..8d74d18 100644
--- a/src/s2e.c
+++ b/src/s2e.c
@@ -637,6 +645,7 @@ ustime_t s2e_nextTxAction (s2ctx_t* s2ctx, u1_t txunit) {
// Something went wrong - should be emitting
LOG(MOD_S2E|ERROR, "%J - radio is not emitting frame - abandoning TX, trying alternative", curr);
ral_txabort(txunit);
+ curr->txflags &= ~TXFLAG_TXING;
goto check_alt;
}
// Looks like it's on air
This patch will not fix the issue you are observing. It will just fix the retries after the initial TX failure. While it is still puzzling why the TX is failing in the first place, this patch will at least prevent the retries from failing. Please apply this patch for your further tests. It will flow into the next release as well.
I thought I would post the first of the 6 combinations of GAP/FUDGE, since it has the patch and TXSTATUS debug print, currently working on the rest of the combinations
Great, thanks for posting the data, which is already quite revealing. Have a look at these lines:
2021-02-15 19:08:27.644 [S2E:VERB] 101:101:101:101 diid=31845 [ant#0] - starting TX in 19ms685us
2021-02-15 19:08:27.670 [any:XDEB] TX STATUS: 0x00000003
2021-02-15 19:08:27.670 [S2E:ERRO] 101:101:101:101 diid=31845 [ant#0] - radio is not emitting frame - abandoning TX, trying alternative
2021-02-15 19:08:27.777 [any:XDEB] TX STATUS: 0x00000002
At time 08:27.644 The downlink command is received from the LNS and scheduled for transmission ~20ms later.
At time 08:27.665 (+20) The downlink is put into the TX buffer of the sx1301 (no log)
At time 08:27.670 (+5) The radio is polled for its status. The gateway reports 0x00000003
(TX_SCHEDULED). The TX operation is aborted. The status reported later by the sx1301 is TX_FREE.
This indicates that at the time when the status is checked, the frame did not start emitting yet. This points at severe drift between MCU clock and radio clock. The clock sync reports also show large worst case drifts:
2021-02-15 19:08:28.325 [SYN:INFO] MCU/SX130X drift stats: min: -3.3ppm q50: -15.7ppm q80: -63.8ppm max: -102.1ppm - threshold q90: -100.4ppm
2021-02-15 19:08:28.325 [SYN:INFO] Mean MCU drift vs SX130X#0: -28.8ppm
but 100 ppm is not enough to drift away more than 5ms between sync events (occurring every TIMESYNC_RADIO_INTV - 2100ms by default).
Still, it would be interesting to see the behavior with larger values for TXCHECK_FUDGE.
If you could, please incorporate the patch from my previous comment into your tests going forward. Thanks!
OK, these tests took awhile today, but here is the 9 combinations of TXFUDGE & GAP, and their results. The test consists of a script that sends down 200 frames per test, spaced out by 5 seconds.
CONFIG ERRORS TXDONE RECEIVED
-------------------------------------------------
Gap20 Fudge 5 12 200 200
Gap20 Fudge 10 17 200 199
Gap20 Fudge 20 6 200 200
Gap30 Fudge 5 1 200 200
Gap30 Fudge 10 1 200 200
Gap30 Fudge 20 2 200 200
Gap50 Fudge 5 0 200 200
Gap50 Fudge 10 0 200 199
Gap50 Fudge 20 0 200 200
Logs for reference: gap20_fudge5.txt gap20_fudge10.txt gap20_fudge20.txt gap30_fudge5.txt gap30_fudge10.txt gap30_fudge20.txt gap50_fudge5.txt gap50_fudge10.txt gap50_fudge20.txt
Thank you for your effort. These are good results. Apparently the default timing parameters are too tight for your hardware platform. I would suggest you configure your installation with TX_AIM_GAP=50ms
and TXCHECK_FUDGE=5ms
.
@pferland-multitech Maybe multitech could evaluate the findings in this thread and check if this is an isolated issue or whether the default (compile time) timing parameters should be adjusted for all multitech installations. For example, if the SPI speed is clocked down relative to the default, or if the SPI driver is introducing delays, this will require adjustments on the timing parameters.
I'm going to close this issue as it seems the fix on the retry bug has been tested enough on our end.
Further timing research on the conduit can occur outside this issue
We've been banging our heads for a bit on this issue, and I'd like to get some insight into what might be going on in the s2e_nextTxAction state machine to cause this "radio is not emitting frame" error. Firstly, the issue is intermittent (one out of every 10-30 or so downlinks). A bit about the setup
Environment
The Test
We are manually issuing an unsolicited downlink frame in class C mode (so no RX1/RX2 timing windows come into play here). The device is muted after joining the LNS, so as to not interfere with the test. We manually send a 2 byte dummy frame to this device every 5-10 seconds, though we have seen the error occur up to an hour after the previous packet, so I can safely say this is not a "back to back" frame issue. The attached logfile shows 7 successful downlinks, then the 8th fails. (primary error logfile: successes_then_failure.txt) When this occurs, it will never succeed in the "retry" (it always goes through 10 retries and eventually fails); maybe that nuance of behavior is important.
At this point I am not convinced it is a server issue, so I'd like to get some insight on this as a potential issue in basic station.
Also, I am very curious why the retries don't seem to function - for one test, I set CLASS_C_BACKOFF_BY setting in the station conf, to allow more time between retries. It still failed all 10 attempts across 5 seconds: emitting_frame_error_500ms.txt. Maybe this information is also useful