lorabasics / basicstation

LoRa Basics™ Station - The LoRaWAN Gateway Software
https://doc.sm.tc/station
Other
340 stars 178 forks source link

Issue with time drift on custom gateway with RAK2287 #133

Open upwis-gabriel opened 2 years ago

upwis-gabriel commented 2 years ago

We were having issues with the time sync on the regular LoRa Packet Forwarder and figured that we'd try this one instead. The setup is as follows: The gateway is custom and based on a TI SOC. It is running ubuntu over a kernel by TI. This board is an expansion for an existing base gateway, so the I/O was limited. Because of that an FTDI usb-serial adapter is used to run both the SPI via MPSSE and the GPS serial port. By modifying libFTDI and libMPSSE to allow for microframes the SPI latency is fairly low, with several complete exchanges occurring within a millisecond. libloragateway was modified so that all the SPI HAL functions instead call on the MPSSE functions. The SPI communication tests from libloragw seem to work just fine. libloragw is taken from sx1302_hal 1.0.5 and further patched with the corecell patch from lgw1302. I also increased DFLT_TX_AIM_GAP to 50 ms in order to reduce the risk of latency being an issue. Basic Station connects to a chirpstack server on another device using chirpstack-gateway-bridge with authorization disabled. The following is the station.conf:

{
    /* If slave-X.conf present this acts as default settings */
    "SX1302_conf": {                 /* Actual channel plan is controlled by server */
        "device": "/dev/spidev0.0",
        "lorawan_public": true,      /* is default */
        "clksrc": 0,                 /* radio_0 provides clock to concentrator */
        "full_duplex": false,
"pps": true,
        "radio_0": {
            /* freq/enable provided by LNS - only HW specific settings listed here */
            "type": "SX1250",
            "rssi_offset": -215.4,
            "rssi_tcomp": {"coeff_a": 0, "coeff_b": 0, "coeff_c": 20.41, "coeff_d": 2162.56, "coeff_e": 0},
            "tx_enable": true,
            "antenna_gain": 0,           /* antenna gain, in dBi */
            "tx_gain_lut":[
                {"rf_power": 12, "pa_gain": 0, "pwr_idx": 15},
                {"rf_power": 13, "pa_gain": 0, "pwr_idx": 16},
                {"rf_power": 14, "pa_gain": 0, "pwr_idx": 17},
                {"rf_power": 15, "pa_gain": 0, "pwr_idx": 19},
                {"rf_power": 16, "pa_gain": 0, "pwr_idx": 20},
                {"rf_power": 17, "pa_gain": 0, "pwr_idx": 22},
                {"rf_power": 18, "pa_gain": 1, "pwr_idx": 1},
                {"rf_power": 19, "pa_gain": 1, "pwr_idx": 2},
                {"rf_power": 20, "pa_gain": 1, "pwr_idx": 3},
                {"rf_power": 21, "pa_gain": 1, "pwr_idx": 4},
                {"rf_power": 22, "pa_gain": 1, "pwr_idx": 5},
                {"rf_power": 23, "pa_gain": 1, "pwr_idx": 6},
                {"rf_power": 24, "pa_gain": 1, "pwr_idx": 7},
                {"rf_power": 25, "pa_gain": 1, "pwr_idx": 9},
                {"rf_power": 26, "pa_gain": 1, "pwr_idx": 11},
                {"rf_power": 27, "pa_gain": 1, "pwr_idx": 14}
            ]
        },
        "radio_1": {
            "type": "SX1250",
            "rssi_offset": -215.4,
            "rssi_tcomp": {"coeff_a": 0, "coeff_b": 0, "coeff_c": 20.41, "coeff_d": 2162.56, "coeff_e": 0},
            "tx_enable": false
        }
        /* chan_multiSF_X, chan_Lora_std, chan_FSK provided by LNS */
    },
    "station_conf": {
        "radio_init": "rinit.sh",
        "RADIO_INIT_WAIT": "5s",
        "RX_POLL_INTV": "10ms",
        "TC_TIMEOUT": "360s",
        "log_file":  "stderr",   /* "station.log" */
        "log_level": "DEBUG",  /* XDEBUG,DEBUG,VERBOSE,INFO,NOTICE,WARNING,ERROR,CRITICAL */
        "log_size":  10000000,
        "log_rotate":  3,
        "gps":  "/dev/ttyUSB1",
        "pps":  "gps"
    }
}

It seems to work for a while, but constantly prints about time sync being rejected. Here is a small sample of what it prints:

2021-11-02 12:00:46.184 [SYN:INFO] MCU/SX130X drift stats: min: +4.3ppm  q50: +74.5ppm  q80: -99.8ppm  max: +222.7ppm - threshold q90: -107.7ppm
2021-11-02 12:00:46.185 [SYN:INFO] Mean MCU drift vs SX130X#0: 23.0ppm
2021-11-02 12:00:46.941 [S2E:VERB] RX 868.3MHz DR5 SF7/BW125 snr=14.0 rssi=-62 xtime=0x82000009C5E328 - updf mhdr=40 DevAddr=01AEB637 FCtrl=90 FCnt=5286 FOpts=[] 02AD16CC..E12E mic=952774489 (41 bytes)
2021-11-02 12:00:48.294 [SYN:VERB] Time sync rejected: quality=3187 threshold=2431
2021-11-02 12:00:51.257 [S2E:VERB] RX 867.9MHz DR5 SF7/BW125 snr=13.5 rssi=-79 xtime=0x8200000A07D603 - updf mhdr=40 DevAddr=00E191B1 FCtrl=90 FCnt=5256 FOpts=[] 02613009..F1B0 mic=-199049753 (31 bytes)
2021-11-02 12:00:52.854 [S2E:VERB] RX 868.1MHz DR5 SF7/BW125 snr=14.0 rssi=-78 xtime=0x8200000A205BFC - updf mhdr=40 DevAddr=01E916B1 FCtrl=90 FCnt=5267 FOpts=[] 021AD7F4..48AA mic=-657448789 (31 bytes)
2021-11-02 12:00:53.564 [SYN:INFO] Time sync qualities: min=665 q90=3045 max=3469 (previous q90=2431)
2021-11-02 12:00:53.585 [S2E:VERB] RX 867.9MHz DR5 SF7/BW125 snr=13.5 rssi=-79 xtime=0x8200000A2B6EDA - updf mhdr=40 DevAddr=00E191B1 FCtrl=90 FCnt=5256 FOpts=[] 02613009..F1B0 mic=-199049753 (31 bytes)
2021-11-02 12:00:55.914 [S2E:VERB] RX 867.5MHz DR5 SF7/BW125 snr=10.2 rssi=-78 xtime=0x8200000A4F07BC - updf mhdr=40 DevAddr=00E191B1 FCtrl=90 FCnt=5256 FOpts=[] 02613009..F1B0 mic=-199049753 (31 bytes)
2021-11-02 12:00:59.890 [SYN:VERB] Time sync rejected: quality=3306 threshold=3045
2021-11-02 12:01:04.104 [SYN:VERB] Time sync rejected: quality=3058 threshold=3045
2021-11-02 12:01:06.214 [SYN:VERB] Time sync rejected: quality=3495 threshold=3045
2021-11-02 12:01:15.483 [S2E:VERB] RX 867.7MHz DR0 SF12/BW125 snr=11.2 rssi=-66 xtime=0x8200000B797B2F - updf mhdr=80 DevAddr=008160AE FCtrl=80 FCnt=3466 FOpts=[] 0167C2BC mic=-568887544 (16 bytes)
2021-11-02 12:01:17.807 [SYN:VERB] Time sync rejected: quality=3350 threshold=3045
2021-11-02 12:01:21.283 [S2E:VERB] RX 868.5MHz DR0 SF12/BW125 snr=8.0 rssi=-69 xtime=0x8200000BD223FC - updf mhdr=80 DevAddr=008160AE FCtrl=80 FCnt=3466 FOpts=[] 0167C2BC mic=-568887544 (16 bytes)
2021-11-02 12:01:24.138 [SYN:VERB] Time sync rejected: quality=3774 threshold=3045
2021-11-02 12:01:35.403 [S2E:VERB] RX 868.5MHz DR5 SF7/BW125 snr=10.2 rssi=-46 xtime=0x8200000CA999DF - updf mhdr=80 DevAddr=001B82D8 FCtrl=80 FCnt=55963 FOpts=[] 02E9512E..69F1 mic=1208999552 (22 bytes)
2021-11-02 12:01:35.730 [SYN:VERB] Time sync rejected: quality=3695 threshold=3045
2021-11-02 12:01:38.894 [SYN:INFO] MCU/SX130X drift stats: min: -1.9ppm  q50: +34.2ppm  q80: +102.2ppm  max: +139.7ppm - threshold q90: +138.2ppm
2021-11-02 12:01:38.894 [SYN:INFO] Mean MCU drift vs SX130X#0: 34.5ppm

After about half an hour it seems to obtain the first PPS pulse (although I can see that the PPS pin on the RAK2287 is pulsing every second on a scope). Then it tries to time sync with the network server with what looks like completely incorrect times:

2021-11-02 12:33:55.742 [SYN:INFO] First PPS pulse acquired
2021-11-02 12:33:55.742 [SYN:INFO] Obtained initial PPS offset (384393) - starting timesync with LNS
2021-11-02 12:33:55.868 [SYN:INFO] Time sync: ustime=0xDC505712 utc=0x5CFCD81C37289 gpsOffset=0x0 ppsOffset=384393 syncQual=657
2021-11-02 12:33:55.868 [SYN:INFO] Time sync: MCU/SX130X#0 ustime=0xDC504B77 xtime=0x82000080519E6E pps_xtime=0x82000100000000
2021-11-02 12:33:55.868 [SYN:INFO] Time sync: Last PPS     ustime=0xDC504B77 xtime=0x82000080519E6E pps_ustime=0x15BFEAD09 pps_xtime=0x82000100000000
2021-11-02 12:33:55.868 [SYN:DEBU] Timesync #2 sent to server
2021-11-02 12:33:55.887 [SYN:VERB] Timesync LNS: tx/rx:0xDC4A7A7C..0xDC4AC499 (18ms973us) us/gps:0x0/0x4B06F392AD81B (pps offset=384393) - 0 solutions
2021-11-02 12:33:55.939 [SYN:DEBU] Timesync #3 sent to server
2021-11-02 12:33:55.959 [SYN:VERB] Timesync LNS: tx/rx:0xDC4B90A5..0xDC4BDBEC (19ms271us) us/gps:0x0/0x4B06F392BB229 (pps offset=384393) - 0 solutions
2021-11-02 12:33:56.011 [SYN:DEBU] Timesync #4 sent to server
2021-11-02 12:33:56.016 [SYN:VERB] Timesync LNS: tx/rx:0xDC4CA6D1..0xDC4CBDAD (5ms852us) us/gps:0x0/0x4B06F392CC898 (pps offset=384393) - 0 solutions
2021-11-02 12:33:56.082 [SYN:DEBU] Timesync #5 sent to server
2021-11-02 12:33:56.097 [SYN:VERB] Timesync LNS: tx/rx:0xDC4DBD71..0xDC4DF9FE (15ms501us) us/gps:0x0/0x4B06F392E0DD8 (pps offset=384393) - 0 solutions
2021-11-02 12:33:56.157 [SYN:DEBU] Timesync #6 sent to server
2021-11-02 12:33:56.166 [SYN:VERB] Timesync LNS: tx/rx:0xDC4EE21D..0xDC4F04DB (8ms894us) us/gps:0x0/0x4B06F392F0C84 (pps offset=384393) - 0 solutions
2021-11-02 12:33:56.229 [SYN:DEBU] Timesync #7 sent to server
2021-11-02 12:33:56.245 [SYN:VERB] Timesync LNS: tx/rx:0xDC4FFA43..0xDC50398B (16ms200us) us/gps:0x0/0x4B06F39303C81 (pps offset=384393) - 0 solutions
2021-11-02 12:33:56.300 [SYN:DEBU] Timesync #8 sent to server
2021-11-02 12:33:56.322 [SYN:VERB] Timesync LNS: tx/rx:0xDC511126..0xDC5168CE (22ms440us) us/gps:0x0/0x4B06F393134D0 (pps offset=384393) - 0 solutions
2021-11-02 12:33:56.371 [SYN:DEBU] Timesync #9 sent to server
2021-11-02 12:33:56.388 [SYN:VERB] Timesync LNS: tx/rx:0xDC52281B..0xDC526B1F (17ms156us) us/gps:0x0/0x4B06F39327F70 (pps offset=384393) - 0 solutions
2021-11-02 12:33:56.443 [SYN:DEBU] Timesync #10 sent to server
2021-11-02 12:33:56.473 [SYN:VERB] Timesync LNS: tx/rx:0xDC533E73..0xDC53B5EB (30ms584us) us/gps:0x0/0x4B06F3933BC55 (pps offset=384393) - 0 solutions
2021-11-02 12:33:56.514 [SYN:DEBU] Timesync #11 sent to server
2021-11-02 12:33:56.523 [SYN:VERB] Timesync LNS: tx/rx:0xDC545536..0xDC54786F (9ms17us) us/gps:0x0/0x4B06F3934793D (pps offset=384393) - 0 solutions
2021-11-02 12:33:58.665 [SYN:ERRO] No PPS pulse for ~-2138 secs
2021-11-02 12:34:00.412 [SYN:VERB] Time sync rejected: quality=2042 threshold=1624

Time since last PPS seems to be negative, so something is off. Now, I realize this is not a supported gateway and the interface we're using is not ideal. But if anyone has an idea about what the issue could be or where to start looking that would be greatly appreciated.

beitler commented 2 years ago

Hi @upwis-gabriel, you are right to assess that the latency on your SPI interface is quite high. The time sync statistics quantify this assessment further:

2021-11-02 12:00:53.564 [SYN:INFO] Time sync qualities: min=665 q90=3045 max=3469 (previous q90=2431)
2021-11-02 12:01:38.894 [SYN:INFO] MCU/SX130X drift stats: min: -1.9ppm  q50: +34.2ppm  q80: +102.2ppm  max: +139.7ppm - threshold q90: +138.2ppm

The time sync quality is basically the time it takes to read out a register from the sx130x (in milliseconds). In your case q90 of the time sync quality is between 2.5 and 3 seconds. This is beyond expected bounds. Also, the measured MCU drift is quite high with values around 140 ppm q90. This may also be an indication that the SPI transfers are also subject to significant jitter.

For your reference, here are the time sync quality and drift values rpi4-based corecell reference design:

[SYN:INFO] Time sync qualities: min=237 q90=242 max=260 (previous q90=259)
[SYN:INFO] MCU/SX130X drift stats: min: +0.0ppm  q50: +0.0ppm  q80: +0.5ppm  max: +0.5ppm - threshold q90: +0.5ppm

Concerning the PPS sync acquisition problem there seem to be a couple things wrong, but in summary it looks like the xtick register is definitely not latching on PPS. Judging by the xtime=0x82000080519E6E your gateway was running for 35 minutes, which matches your description. However, the observed latched xtick 0x82000100000000 is definitely not right as this value is in the future relative to your 'current' xtime - hence the negative value in this warning:

2021-11-02 12:33:58.665 [SYN:ERRO] No PPS pulse for ~-2138 secs

Have you checked that the xtick register is latched on every PPS (assuming you have the GPS setting enabled in the HAL).

upwis-gabriel commented 2 years ago

Hello @beitler Thanks for the response. Is the delay measured as time sync quality really time in milliseconds? Because that would also mean that the rpi4 takes 250 ms to do the same, which seems like a lot to me. Isn't the sync procedure just "get PPS tick, get current tick, get system time, calculate reference"? As said, the SPI signals look fine on a scope, so unless there's a whole lot of other SPI traffic occurring simultaneously with the time sync or the time sync takes literally thousands of SPI transfers I can't imagine that taking so long. I would not be surprised if there is some jitter, however. Maybe most of the communication goes through fine but it occasionally fails and gets stuck. We also have a version of the same gateway based on a raspberry pi CM4, which should be significantly faster if the application ends up being CPU-bound on this one. Right now we're back to running the normal semtech packet forwarder based on 1.0.5, and you are correct about the xtick issues. I noticed that the GPS sent data over the serial port without a PPS signal to the radio, causing the program to sync an old PPS tick, a new timestamp and a new timer tick. This ended up with a temporary drift which should correct itself at the next PPS, but compounded with the (I think known and fixed in newer versions) issue where the timer and PPS ticks for some reason had five extra bytes that were kept track of in the library and could get out of sync. After patching these issues (basically not syncing if the PPS tick hasn't updated) it seems to be working fairly well. That fact furthers my suspicion that there's an issue with how the time sync quality is determined - there's no way that it would work (especially using class b) if it actually took that long to communicate. This all seems like a bit of a mystery, and I'd like to thank you again for your time.

beitler commented 2 years ago

Oh, I'm very sorry. Indeed, you are absolutely right, the timesync quality is not in milliseconds but in microseconds (as described in the documentation). Definitely, SPI transfers in the order of seconds are way too extreme. Basics Station definitely keeps a closer eye on time synchronization than the packet forwarder. This is, because a significant number of subtle as well as severe performance issues arise from the lack of proper time synchronization. The approach taken in Basics Station is to make time synchronization issues very visible in order to detect performance problems early on.

[...] After patching these issues [...] (basically not syncing if the PPS tick hasn't updated) it seems to be working fairly well.

Can you give more detail on the nature of these patches? From the context it is not clear to me, which code base required these patches. Did you patch the HAL or the packet forwarder? Did you report your findings to the HAL project?

If I understand correctly, then your current setup is comprised of a patched sx1302 HAL + packet_forwarder and you have successfully emitted class B beacons with that setup? Is that correct? Have you tried your patched sx1302 HAL + basics station?

If you still feel a sense of mystery around the time synchronization aspects in Basics Station, please consult the documentation related to timing and feel free post here any questions that arise.

upwis-gabriel commented 2 years ago

Right, if it's microseconds it kind of makes sense. While the individual SPI transactions are quick, it's very possible that it could reach a couple of milliseconds with overhead and perhaps some collision. Theoretically the sync time shouldn't differ too much and always be around the lower end of that range but the hardware is quite old and slow. Sadly I'm cursed with it, so I'll try to make it work. Switching to an RT linux kernel should help with that. Basics station does seem like a more competent packet forwarder than the semtech one, so if I can make it work well enough I think it would be preferable. The patches are to timestamp_counter_update and lgw_gps_sync in the 1.0.5 sx1302 HAL library, along with a minor patch in the packet forwarder. Just like the 1.0.5 corecell patch in this project, my patch to timestamp_counter_update tries to fix the timer wrapping issue, but just in a quicker and dirtier way. It's a fixed issue in later releases of the HAL project, so I doubt my ugly fix helps anyone. The fix in lgw_gps_sync is just adding the following lines right after calculating the time differences:

    if (cnt_diff == 0){
    DEBUG_MSG("Counter did not increase!\n");
    return LGW_GPS_ERROR;
    } 

This keeps it from screwing up the reference timer when there's a serial port timestamp but no PPS signal. It may be useful for the HAL project, but if I recall correctly Basics Station doesn't use that function. I'll check if they have mitigated this issue in the HAL project and if it's still there I'll suggest that workaround. The patch in the packet forwarder is just stopping it from continually adding beacons further and further in the future, which are rejected for being too far into the future which in turn causes the packet forwarder to queue beacons further in the future. I don't think that's fixed in newer HAL releases, so I'll suggest that addition to the HAL project. With these patches class B works and the time drift is resolved. There's still the occasional missed message, but it's not too bad. I have only used the lgw_gps_sync patch with Basics Station, and I don't think it made a difference. The other two patches don't really apply.

smtc-bot commented 1 year ago

Thank you for your inquiry.

Customers are encouraged to submit technical questions via our dedicated support portal at https://semtech.force.com/ldp/ldp_support.

We invite all users to visit the LoRa Developer Portal Forum at https://forum.lora-developers.semtech.com and to join the thriving LoRa development community!