cblack-f3 commented 3 years ago

We've been banging our heads for a bit on this issue, and I'd like to get some insight into what might be going on in the s2e_nextTxAction state machine to cause this "radio is not emitting frame" error. Firstly, the issue is intermittent (one out of every 10-30 or so downlinks). A bit about the setup

Environment

GW: Multitech Conduit fw version 5.3.0 : Station Ver : 2.0.5(mlinux/std) 2020-10-19 19:44:13
LNS: Chirpstack network server 3.12
End-device: Custom device - Class C
Station conf: station.conf.txt
Station bootlog for reference: basicstation_bootlog.txt
The Test

We are manually issuing an unsolicited downlink frame in class C mode (so no RX1/RX2 timing windows come into play here). The device is muted after joining the LNS, so as to not interfere with the test. We manually send a 2 byte dummy frame to this device every 5-10 seconds, though we have seen the error occur up to an hour after the previous packet, so I can safely say this is not a "back to back" frame issue. The attached logfile shows 7 successful downlinks, then the 8th fails. (primary error logfile: successes_then_failure.txt) When this occurs, it will never succeed in the "retry" (it always goes through 10 retries and eventually fails); maybe that nuance of behavior is important.

At this point I am not convinced it is a server issue, so I'd like to get some insight on this as a potential issue in basic station.

Also, I am very curious why the retries don't seem to function - for one test, I set CLASS_C_BACKOFF_BY setting in the station conf, to allow more time between retries. It still failed all 10 attempts across 5 seconds: emitting_frame_error_500ms.txt. Maybe this information is also useful

beitler commented 3 years ago

Thanks for the well structured description. I am sure we will find out what is going on in your setup.

Judging by the logs, your frame is definitely placed into the radio TX buffer, but the TX status verification is failing. These are time critical operations. from the high level description is sounds to me that the TX timing parameters are not properly tuned to the expected SPI delays for your gateway:

https://github.com/lorabasics/basicstation/blob/bd17e53ab1137de6abb5ae48d6f3d52f6c268299/src/s2conf.h#L190-L193

To find out, please do the following:

Print the low-level tuning parameters with station -p and post them here.
Run your tests again with libloragw compiled in DEBUG mode (explained here) and station run with XDEBUG loglevel. Please post the logs here.
When conducting your test it would help to have a spectrum analyser handy to check if the frame for which station is unable to verify the TX status is actually emitted by the radio or not.

cblack-f3 commented 3 years ago

Thanks for the quick reply: station -p

   str RADIODEV             = "/dev/spidev?.0" builtin      default radio device
    u4 LOGFILE_SIZE         = "10MB"     builtin      default size of a logfile
    u4 LOGFILE_ROTATE       = 3          builtin      besides current log file keep *.1..N (none if 0)
    u4 TCP_KEEPALIVE_EN     = 1          builtin      TCP keepalive enabled
    u4 TCP_KEEPALIVE_IDLE   = 60         builtin      TCP keepalive TCP_KEEPIDLE [s]
    u4 TCP_KEEPALIVE_INTVL  = 15         builtin      TCP keepalive TCP_KEEPINTVL [s]
    u4 TCP_KEEPALIVE_CNT    = 4          builtin      TCP keepalive TCP_KEEPCNT
    u4 MAX_JOINEUI_RANGES   = 10         builtin      max ranges to suppress unwanted join requests
ustime CUPS_CONN_TIMEOUT    = "60s"      builtin      connection timeout
ustime CUPS_OKSYNC_INTV     = "24h"      builtin      regular check-in with CUPS for updates
ustime CUPS_RESYNC_INTV     = "1m"       builtin      check-in with CUPS for updates after a failure
    u4 CUPS_BUFSZ           = "8KB"      builtin      read from CUPS in chunks of this size
ustime GPS_REPORT_DELAY     = "120s"     builtin      delay GPS reports and consolidate
ustime GPS_REOPEN_TTY_INTV  = "1s"       builtin      recheck TTY open if it failed
ustime GPS_REOPEN_FIFO_INTV = "1s"       builtin      recheck if FIFO writer fake GPS
ustime CMD_REOPEN_FIFO_INTV = "1s"       builtin      recheck if FIFO writer
ustime RX_POLL_INTV         = "20ms"     builtin      interval to poll SX1301 RX FIFO
ustime TC_TIMEOUT           = "2s"       station.conf reconnected to muxs
ustime CLASS_C_BACKOFF_BY   = "100ms"    builtin      retry interval for class C TX attempts
    u4 CLASS_C_BACKOFF_MAX  = 10         builtin      max number of class C TX attempts
ustime RADIO_INIT_WAIT      = "200ms"    builtin      max wait for radio init command to finish
ustime PPS_VALID_INTV       = "10m"      builtin      max age of last PPS sync for GPS time conversions
ustime TIMESYNC_RADIO_INTV  = "2100ms"   builtin      interval to resync MCU/SX1301
ustime TIMESYNC_LNS_RETRY   = "71ms"     builtin      resend timesync message to server
ustime TIMESYNC_LNS_PAUSE   = "5s"       builtin      pause after unsuccessful volley of timesync messages
    u4 TIMESYNC_LNS_BURST   = 10         builtin      volley of timesync messages before pausing
ustime TIMESYNC_REPORTS     = "5m"       builtin      report interval for current timesync status
ustime TX_MIN_GAP           = "10ms"     builtin      min distance between two frames being TXed
ustime TX_AIM_GAP           = "20ms"     builtin      aim for this TX lead time, if delayed should not fall under min
ustime TX_MAX_AHEAD         = "600s"     builtin      maximum time message can be scheduled into the future
ustime TXCHECK_FUDGE        = "5ms"      builtin      check radio state this time into ongoing TX
ustime BEACON_INTVL         = "128s"     builtin      beaconing interval
    u4 TLS_SNI              = true       builtin      Set and verify server name of TLS connections

I will work on libloragw in DEBUG and using a spec an to see if the packet actually sent. I would say it is not, since our end device sees a frame on 100% of the success cases, and does not see the frame on 100% of the 'frame not emitted' cases

cblack-f3 commented 3 years ago

@beitler Here is a logfile again, with XDEBUG. I've been working with Multitech on the debug libloragw. They sent me a new ipk for station, let me know if this has the appropriate level. This one failed on the 9th frame I sent down. I added a space between each frame for readability

station_log_xdebug.txt

Also working on that spec an - I need to borrow one from the office so might be a few days

cblack-f3 commented 3 years ago

I've hooked up an envelope follower to my scope

When it succeeds, I get this: DS4_QuickPrint1

When I produce the error condition, there is NO RF activity.

beitler commented 3 years ago

Thank you for the details.

Let me walk through each line of the relevant part of the log:

INFO: tx_start_delay=1497 (1497.000000) - (1497, bw_delay=0.000000, notch_delay=0.000000)

The HAL prints this this line as part of the lgw_send setup sequence: https://github.com/Lora-net/lora_gateway/blob/a955619271b5d0a46d32e08150acfbc1eed183b7/libloragw/src/loragw_hal.c#L1414 https://github.com/Lora-net/lora_gateway/blob/a955619271b5d0a46d32e08150acfbc1eed183b7/libloragw/src/loragw_hal.c#L407 Although this line is printed first, it does not mean that it happened chronologically before the next log lines. The reason is that the HAL uses printf directly for logging. Station caches the log lines in a buffer and emits the buffer once full or timeout reached in order to reduce IO overhead. That's why station logs are slightly delayed.

2021-02-08 18:32:51.516 [AIO:XDEB] [3|WS] < {"msgtype":"dnmsg","DevEui":"01-01-01-01-01-01-01-01","dC":2,"diid":13086,"pdu":"606d23dc000a0a0003000000700331ffff01c7fff68e2aafa3","priority":1,"RX2DR":10,"RX2Freq":923900000,"xtime":49258120969134059,"rctx":0}

This is the Websocket reporting that a downlink message came in. Notice, that a class C unsolicited downlink does not require xtime. It generates a warning later on.

2021-02-08 18:32:51.516 [S2E:WARN] RxDelay mapped to 1 as it was not present!

RXDelay parameter was not passed, so 1 second is assumed.

2021-02-08 18:32:51.516 [SYN:ERRO] Cannot convert xtime=0xAF000002BA622B - obsolete session: 175 (current 50)

This is the xtime related warning. Basically the timestamp cannot be properly mapped to any time reference and cannot be converted to radio time. However, since class C unsolicited downlinks are scheduled 'asap' the xtime value is ignored anyway and this warning does not have an effect on the downlink processing.

2021-02-08 18:32:51.516 [S2E:VERB] 101:101:101:101 diid=13086 [ant#0] - starting TX in 19ms694us

The received dnmsg generated a TX job which is scheduled to happen in about 20ms which corresponds to TX_AIM_GAP. This basically means that station schedules this downlink as soon as possible.

2021-02-08 18:32:51.541 [S2E:ERRO] 101:101:101:101 diid=13086 [ant#0] - radio is not emitting frame - abandoning TX, trying alternative

5ms into the ongoing transmission (TXCHECK_FUDGE) station retrieves the radio tx status and expects it to be in TXSTATUS_EMITTING. https://github.com/lorabasics/basicstation/blob/bd17e53ab1137de6abb5ae48d6f3d52f6c268299/src/s2e.c#L635 The error message indicates that the radio was found not to be in TXSTATUS_EMITTING state.

2021-02-08 18:32:51.641 [S2E:ERRO] 101:101:101:101 diid=13086 [ant#0] - radio is not emitting frame - abandoning TX, trying alternative

The retry happens 100ms later and failing for the same reason.

Unfortunately, I was not able to reproduce the problem on the hardware that is available to me. Given that in your case this issue occurs sporadically, it is probably related so a timing problem varying with a jitter. I propose you adjust the relevant TX timing parameters a bit and see if the problem occurs more or less often. Specifically, please try the following values in various combinations: TX_AIM_GAP --> 20ms, 30ms, 50ms TXCHECK_FUDGE --> 5ms, 10ms, 20ms These values can be set inside station.conf.

The following flow diagram may help to understand how these values work together:

Also, it would be good to print out the actual tx_status of the radio at the point where EMITTING is expected. This would require adding a corresponding printf statement either in the HAL: https://github.com/Lora-net/lora_gateway/blob/a955619271b5d0a46d32e08150acfbc1eed183b7/libloragw/src/loragw_hal.c#L1636 or in station: https://github.com/lorabasics/basicstation/blob/bd17e53ab1137de6abb5ae48d6f3d52f6c268299/src/ral_lgw.c#L236

cblack-f3 commented 3 years ago

Thanks for the very descriptive reply! I will start on this right away and ping MT about the rebuild with some extra debug statements.

Seeing that no frame is seen on the RF detector, would you expect "TXCHECK_FUDGE" to have an impact? I know I am getting beyond my understanding of the internals, but if this error is that we are checking at the wrong time, it would seem like I would still see the packet at the RF out.

pferland-multitech commented 3 years ago

Would this patch file work to output the status, or would you need it to be more specific?

`--- ral_lgw.c 2021-02-11 11:20:53.683043806 -0600 +++ ral_lgw_new.c 2021-02-11 11:29:49.287054602 -0600 @@ -240,6 +240,7 @@

else

 int err = lgw_status(TX_STATUS, &status);

endif

LOG(XDEBUG, "TX STATUS: %#010x", status); if (err != LGW_HAL_SUCCESS) { LOG(MOD_RAL|ERROR, "lgw_status failed"); return TXSTATUS_IDLE; `

beitler commented 3 years ago

@pferland-multitech Indeed, this patch would help to visualize the radio status when station expects the EMITTING state.

Also, what struck me in the log is that the retries are scheduled only 100ms apart while it should be 120ms (CLASS_C_BACKOFF_BY + TX_AIM_GAP). Which would mean that TX_AIM_GAP is not honoured. It turns out the s2e tx state machine has a bug which causes this behavior. The patch below fixes that:

diff --git a/src/s2e.c b/src/s2e.c
index f8fd7f5..8d74d18 100644
--- a/src/s2e.c
+++ b/src/s2e.c
@@ -637,6 +645,7 @@ ustime_t s2e_nextTxAction (s2ctx_t* s2ctx, u1_t txunit) {
                 // Something went wrong - should be emitting
                 LOG(MOD_S2E|ERROR, "%J - radio is not emitting frame - abandoning TX, trying alternative", curr);
                 ral_txabort(txunit);
+                curr->txflags &= ~TXFLAG_TXING;
                 goto check_alt;
             }
             // Looks like it's on air

This patch will not fix the issue you are observing. It will just fix the retries after the initial TX failure. While it is still puzzling why the TX is failing in the first place, this patch will at least prevent the retries from failing. Please apply this patch for your further tests. It will flow into the next release as well.

cblack-f3 commented 3 years ago

I thought I would post the first of the 6 combinations of GAP/FUDGE, since it has the patch and TXSTATUS debug print, currently working on the rest of the combinations

gap20_fudge5.txt

beitler commented 3 years ago

Great, thanks for posting the data, which is already quite revealing. Have a look at these lines:

2021-02-15 19:08:27.644 [S2E:VERB] 101:101:101:101 diid=31845 [ant#0] - starting TX in 19ms685us
2021-02-15 19:08:27.670 [any:XDEB] TX STATUS: 0x00000003
2021-02-15 19:08:27.670 [S2E:ERRO] 101:101:101:101 diid=31845 [ant#0] - radio is not emitting frame - abandoning TX, trying alternative
2021-02-15 19:08:27.777 [any:XDEB] TX STATUS: 0x00000002

At time 08:27.644 The downlink command is received from the LNS and scheduled for transmission ~20ms later. At time 08:27.665 (+20) The downlink is put into the TX buffer of the sx1301 (no log) At time 08:27.670 (+5) The radio is polled for its status. The gateway reports 0x00000003 (TX_SCHEDULED). The TX operation is aborted. The status reported later by the sx1301 is TX_FREE.

https://github.com/Lora-net/lora_gateway/blob/a955619271b5d0a46d32e08150acfbc1eed183b7/libloragw/inc/loragw_hal.h#L141-L142

This indicates that at the time when the status is checked, the frame did not start emitting yet. This points at severe drift between MCU clock and radio clock. The clock sync reports also show large worst case drifts:

2021-02-15 19:08:28.325 [SYN:INFO] MCU/SX130X drift stats: min: -3.3ppm  q50: -15.7ppm  q80: -63.8ppm  max: -102.1ppm - threshold q90: -100.4ppm
2021-02-15 19:08:28.325 [SYN:INFO] Mean MCU drift vs SX130X#0: -28.8ppm

but 100 ppm is not enough to drift away more than 5ms between sync events (occurring every TIMESYNC_RADIO_INTV - 2100ms by default).

Still, it would be interesting to see the behavior with larger values for TXCHECK_FUDGE.

If you could, please incorporate the patch from my previous comment into your tests going forward. Thanks!

cblack-f3 commented 3 years ago

OK, these tests took awhile today, but here is the 9 combinations of TXFUDGE & GAP, and their results. The test consists of a script that sends down 200 frames per test, spaced out by 5 seconds.

ERRORS is the number of occurrences of the frame not emitting error
TXDONE is the number of frames successfully sent from the gateway (The patch is working: retries are now succeeding)
RECEIVED is from the device logs, showing how many frames made it to the end device

CONFIG            ERRORS      TXDONE     RECEIVED
-------------------------------------------------
Gap20 Fudge 5         12         200          200
Gap20 Fudge 10        17         200          199
Gap20 Fudge 20         6         200          200

Gap30 Fudge 5          1         200          200
Gap30 Fudge 10         1         200          200
Gap30 Fudge 20         2         200          200

Gap50 Fudge 5          0         200          200
Gap50 Fudge 10         0         200          199
Gap50 Fudge 20         0         200          200

Logs for reference: gap20_fudge5.txt gap20_fudge10.txt gap20_fudge20.txt gap30_fudge5.txt gap30_fudge10.txt gap30_fudge20.txt gap50_fudge5.txt gap50_fudge10.txt gap50_fudge20.txt

beitler commented 3 years ago

Thank you for your effort. These are good results. Apparently the default timing parameters are too tight for your hardware platform. I would suggest you configure your installation with TX_AIM_GAP=50ms and TXCHECK_FUDGE=5ms.

@pferland-multitech Maybe multitech could evaluate the findings in this thread and check if this is an isolated issue or whether the default (compile time) timing parameters should be adjusted for all multitech installations. For example, if the SPI speed is clocked down relative to the default, or if the SPI driver is introducing delays, this will require adjustments on the timing parameters.

cblack-f3 commented 3 years ago

I'm going to close this issue as it seems the fix on the retry bug has been tested enough on our end.

Further timing research on the conduit can occur outside this issue

lorabasics / basicstation

Not emitting frame issues #87

Environment

The Test

else

endif