mcci-catena / arduino-lmic

LoraWAN-MAC-in-C library, adapted to run under the Arduino environment
https://forum.mcci.io/c/device-software/arduino-lmic/
MIT License
649 stars 212 forks source link

Arduino LMIC stops after some time (6-9 hours) and doesn’t sends data #968

Open mirhamza708 opened 1 month ago

mirhamza708 commented 1 month ago

Hello guys, I need help with this because I am trying for a month now to fix this. I have a node device and I have done all the configuration according to the github documentation of MCCI

I use RFM96 chip with stm32f103c8t6 and the pinmap is as follows

// Pin mapping
const lmic_pinmap lmic_pins = {
    .nss = PA4,
    .rxtx = LMIC_UNUSED_PIN,
    .rst = PC14,
    .dio = {PA1, PB13, PB11},
};

my platformio.ini file is as follows

; PlatformIO Project Configuration File
;
;   Build options: build flags, source filter
;   Upload options: custom upload port, speed and extra flags
;   Library options: dependencies, extra library storages
;   Advanced options: extra scripting
;
; Please visit documentation for the other options and examples
; https://docs.platformio.org/page/projectconf.html

[env:genericSTM32F103C8]
platform = ststm32
board = genericSTM32F103C8
framework = arduino
debug_tool = stlink
board_build.core = stm32
lib_deps = 
    adafruit/Adafruit ADS1X15@^2.5.0
    bblanchon/ArduinoJson@^7.2.0
    ; matthijskooijman/IBM LMIC framework@^1.5.1
    mcci-catena/MCCI LoRaWAN LMIC library@^4.1.1
build_flags =
    -D ARDUINO_LMIC_PROJECT_CONFIG_H_SUPPRESS
    -D CFG_eu868=1
    -D CFG_sx1276_radio=1
    -D LMIC_LORAWAN_SPEC_VERSION=LMIC_LORAWAN_SPEC_VERSION_1_0_2

The devices functions properly just the Lorawan side of things doesnt seem right, here is an image from chirpstack server.

a screenshot from the serial port the join at 9:07 pm is here:

-157610058: EV_TXSTART
-157288904: EV_JOINED
netid: 0
devaddr: -------- (just to hide it)
AppSKey: --------------------------------------------------- (just to hide it)
NwkSKey: --------------------------------------------------- (just to hide it)
-157288075: EV_TXSTART
-157141837: EV_TXCOMPLETE (includes waiting for RX windows)

I don’t understand why the time gets overflown? its a signed integer in the library, can someone guide why used a signed integer for time keeping? I saw a comment in the source that the negative means the time has already passed. Its confusing so please if someone can guide I will be really thankful.

here is an image from the chirpstack server 8 Hours reset chirpstack

I have placed an 8 hours reset of the lorawan stack so it does this every 8 hours just to avoid a complete stop:

void lmicStart (void) {
      // LMIC init
  os_init();
  // Reset the MAC state. Session and pending data transfers will be discarded.
  LMIC_reset();
  // Set up the channels used by the Things Network, which corresponds
  // to the defaults of most gateways. Without this, only three base
  // channels from the LoRaWAN specification are used, which certainly
  // works, so it is good for debugging, but can overload those
  // frequencies, so be sure to configure the full frequency range of
  // your network here (unless your network autoconfigures them).
  // Setting up channels should happen after LMIC_setSession, as that
  // configures the minimal channel set.
  // NA-US channels 0-71 are configured automatically
  LMIC_setupChannel(0, 868100000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);      // g-band
  LMIC_setupChannel(1, 868300000, DR_RANGE_MAP(DR_SF12, DR_SF7B), BAND_CENTI);      // g-band
  LMIC_setupChannel(2, 868500000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);      // g-band
  LMIC_setupChannel(3, 867100000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);      // g-band
  LMIC_setupChannel(4, 867300000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);      // g-band
  LMIC_setupChannel(5, 867500000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);      // g-band
  LMIC_setupChannel(6, 867700000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);      // g-band
  LMIC_setupChannel(7, 867900000, DR_RANGE_MAP(DR_SF12, DR_SF7),  BAND_CENTI);      // g-band
  LMIC_setupChannel(8, 868800000, DR_RANGE_MAP(DR_FSK,  DR_FSK),  BAND_MILLI);      // g2-band
  // TTN defines an additional channel at 869.525Mhz using SF9 for class B
  // devices' ping slots. LMIC does not have an easy way to define set this
  // frequency and support for class B is spotty and untested, so this
  // frequency is not configured here.

  // Disable link check validation
  LMIC_setLinkCheckMode(0);
  // // TTN uses SF9 for its RX2 window.
  // LMIC.dn2Dr = DR_SF9;
  // Set data rate and transmit power for uplink (note: txpow seems to be ignored by the library)
  LMIC_setDrTxpow(DR_SF10,14);
  LMIC_setClockError(MAX_CLOCK_ERROR * 10 /100);
  // Start job (sending automatically starts OTAA too)
  do_send(&mySendjob);
}

Without the 10% clock error (clock is setup at 24MHz using the 8MHz external oscillator and passed through the pll mul) set the join procedure takes 10-15 minutes. The hardware was designed by someone else so I can't really say much about it but the question is if it runs for 7-8 hours what's the reason that makes it stall.

Thank you guys, and if anyother information is required then please let me know.

robepapp commented 1 month ago

Hi,

Additional information as i think I have the same problem with an STM32L081CZ and an sx1272 radio. The code based on the ttn-otaa example, tailored to my hardware. After 6-8 hours of successful operation, node misses sending data (or sends occasionally once per day) In the log I can see similar minus values for the ostime like in the previous case. Debug examples:

[2024-10-20 00:26:40.749] Current interrupts:1000 [2024-10-20 00:26:40.768] 2129580220: engineUpdate, opmode=0x908 [2024-10-20 00:26:40.783] EV_TXSTART [2024-10-20 00:26:40.800] Leaving onEvent [2024-10-20 00:26:40.800] 2129583168: TXMODE, freq=867100000, len=16, SF=12, BW=125, CR=4/5, IH=0 [2024-10-20 00:26:43.105] 2129727365: setupRx1 txrxFlags 0x20 --> 01 [2024-10-20 00:26:43.122] start single rx: now-rxtime: 14 [2024-10-20 00:26:43.139] 2129728819: RXMODE_SINGLE, freq=867100000, SF=12, BW=125, CR=4/5, IH=0 [2024-10-20 00:26:43.344] rxtimeout: entry: 2129742353 rxtime: 2129727983 entry-rxtime: 14370 now-entry: 14 rxtime-txend: 62375 [2024-10-20 00:26:44.105] 2129789863: setupRx2 txrxFlags 0x1 --> 02 [2024-10-20 00:26:44.122] start single rx: now-rxtime: 13 [2024-10-20 00:26:44.141] 2129790644: RXMODE_SINGLE, freq=869525000, SF=12, BW=125, CR=4/5, IH=0 [2024-10-20 00:26:44.345] rxtimeout: entry: 2129804851 rxtime: 2129790483 entry-rxtime: 14368 now-entry: 13 rxtime-txend: 124875 [2024-10-20 00:26:44.395] 2129806828: processRx2DnData txrxFlags 0x2 --> 00 [2024-10-20 00:26:44.413] 2129808839: processDnData_norx txrxFlags 00 --> 20 [2024-10-20 00:26:44.446] EV_TXCOMPLETE (includes waiting for RX windows) [2024-10-20 00:26:44.462] Leaving onEvent [2024-10-20 00:26:44.479] 2129813202: engineUpdate, opmode=0x900 [2024-10-20 00:31:44.444] Prepare data for transmisson. [2024-10-20 00:31:44.444] Current interrupts:1000 [2024-10-20 00:31:44.461] -2146406124: engineUpdate, opmode=0x908 [2024-10-20 00:31:44.494] -2146403958: engineUpdate, opmode=0x108 [2024-10-20 00:31:44.512] -2146403688: engineUpdate, opmode=0x108

Or the next with bit more details:

Current interrupts:1000 1332960928: engineUpdate, opmode=0x908 Enter onEvent os_getTime = 1332963640 osTicks, 21327 sec EV_TXSTART Leaving onEvent 1332966051: TXMODE, freq=867700000, len=16, SF=12, BW=125, CR=4/5, IH=0 do_send finished Tx result:0 1333110249: setupRx1 txrxFlags 0x20 --> 01 start single rx: now-rxtime: 13 1333111708: RXMODE_SINGLE, freq=867700000, SF=12, BW=125, CR=4/5, IH=0 rxtimeout: entry: 1333125234 rxtime: 1333110866 entry-rxtime: 14368 now-entry: 13 rxtime-txend: 62375 1333172749: setupRx2 txrxFlags 0x1 --> 02 start single rx: now-rxtime: 13 1333174166: RXMODE_SINGLE, freq=869525000, SF=12, BW=125, CR=4/5, IH=0 rxtimeout: entry: 1333187734 rxtime: 1333173366 entry-rxtime: 14368 now-entry: 14 rxtime-txend: 124875 1333189482: processRx2DnData txrxFlags 0x2 --> 00 1333191613: processDnData_norx txrxFlags 00 --> 20 Enter onEvent os_getTime = 1333193470 osTicks, 21331 sec EV_TXCOMPLETE (includes waiting for RX windows) Leaving onEvent 1333197636: engineUpdate, opmode=0x900 =0x108 -1096335584: engineUpdate, opmode=0x108 -1096335366: engineUpdate, opmode=0x108 -1096334167: engineUpdate, opmode=0x108 -1096333227: engineUpdate, opmode=0x108 -1096331804: engineUpdate, opmode=0x108 -1096329617: engineUpdate, opmode=0x108

The ostime_t is a uint32_t, it should not be an overflow, neither became a negative value so I dont understand that behavior.

terrillmoore commented 1 month ago

ostime_t is not a uint32_t. It is an int32_t (see line 40 of oslmic_types.h). s4_t is int32_t and is signed.

Negative time values are expected and not a problem. The LMIC is coded taking that overflow into account. It's one of the two possible styles of handling time overflow. The classic one is using uint32_t, but the signed approach was taking in the original IBM Zurich code, and it was done correctly, as far as I can tell. I personally prefer the unsigned style, but in C with twos-complement and silent overflow, the two forms are 100% equivalent (though individual comparisons must be done differently).

robepapp commented 1 month ago

Thank you very much for your quick reply, I really appreciate it! But actually the behavior of my node to stop sending messages apparently coincides with the event when the time value goes negative. And I can't find any solution to resolve it currently.

terrillmoore commented 1 month ago

Sorry you're having problems. Because other users don't have this problem (specifically, because MCCI has run many devices for many months at a time), I have to guess that it's a system problem (the combination of LMIC and the rest of the firmware on the system) rather than a bug that is localized to the LMIC. To solve these, one needs to look at the entire code base. I suggest that you file a separate ticket, and attach a complete code base that demonstrates the problem -- of course, if you don't want to disclose your full code, you may need to develop a toy version that shows the bug. Sometimes preparing the toy version makes the bug go away; and that also gives a clue as to what to look for.

robepapp commented 1 month ago

Thank You, I will do so!

mirhamza708 commented 3 weeks ago

Thank You, I will do so!

Please remember to tag me there, I am also having the problem.

robepapp commented 3 weeks ago

In the last few days I have been testing my node with the basic ttn-otaa example program that I have attached too, but the problem still exists.

iM881aLoRaOtaa.txt

The behavior of the node is the same. The next is from the screenshot from the ChiprStack server DeviceData:

image

It can be seen that the scheduled data transmission is missed after a time (it is usually 7-10 hours) and the node start uploading data once or twice a day in apparently random cases. The original frequency of data transmission was 10 minutes, but the phenomenon is the same in case of 5 minutes. as well.

When restarting, everything starts from the beginning - that is the workaround applied by @mirhamza708

terrillmoore commented 2 weeks ago

Once again. I'm sure everyone thinks they've given me an exact repro, but nobody has. It does not happen for me, with thousands of devices deployeed and a varient of applciations.

In order to debug this, I need;

  1. a complete app that demonstrates the problem. Attach a zip file to the ticket, please. Don't assume that I have it, please.
  2. exact information on your configuration (region, etc.) I see that some of you are using EU868.
  3. exact information on the device and Arduino BSP version that you are using, along with any other libraries.
  4. exact instructions on how to duplicate with The Things Network. Chirpstack has too many configuration knobs, I cannot set up Chirpstack to duplicatie anyone else's setup. I rely on TTN because they make most of the decisions and there aren't any knobs. Once we know that it works (or doesn't work) with TTN, we can explore other LNSs.

I know this might be too much work. But I won't be able to respond unless someone undertakes this. This is free software, folks: if you have a problem, the onus is partially on you to help replicate in order to get support.

Best regards, --Terry

mirhamza708 commented 2 weeks ago

I will try to share the requested information. I have not used TTN but I will set this up. I will try to share a complete example with the issue being properly elaborated.

Thank you.

robepapp commented 1 week ago

Hi, I configured one of the Lora nodes under the TTN network with the recommended default settings, but the error phenomenon is still the same - after about 8 hours of continuous, regular data transmission, the data transmission changed to random, once or twice daily upload. If necessary, I will provide the contact information of the TTN Console. I have also attached the relevant project files in a zip file. STM32L081 Lora node PR.zip

Many thanks and Regards, Robert

terrillmoore commented 1 week ago

Thank you for posting the files. I have reviewed the code. I see you have serial debug output enabled. Can you please also post the debug output when things start to malfunction? Best regards, --Terry

robepapp commented 1 week ago

Hi, The following is a debug example from a log, when a transmit procedure started, but not finished succesfully:

Current interrupts:1000 1332960928: engineUpdate, opmode=0x908 Enter onEvent os_getTime = 1332963640 osTicks, 21327 sec EV_TXSTART Leaving onEvent 1332966051: TXMODE, freq=867700000, len=16, SF=12, BW=125, CR=4/5, IH=0 do_send finished Tx result:0 1333110249: setupRx1 txrxFlags 0x20 --> 01 start single rx: now-rxtime: 13 1333111708: RXMODE_SINGLE, freq=867700000, SF=12, BW=125, CR=4/5, IH=0 rxtimeout: entry: 1333125234 rxtime: 1333110866 entry-rxtime: 14368 now-entry: 13 rxtime-txend: 62375 1333172749: setupRx2 txrxFlags 0x1 --> 02 start single rx: now-rxtime: 13 1333174166: RXMODE_SINGLE, freq=869525000, SF=12, BW=125, CR=4/5, IH=0 rxtimeout: entry: 1333187734 rxtime: 1333173366 entry-rxtime: 14368 now-entry: 14 rxtime-txend: 124875 1333189482: processRx2DnData txrxFlags 0x2 --> 00 1333191613: processDnData_norx txrxFlags 00 --> 20 Enter onEvent os_getTime = 1333193470 osTicks, 21331 sec EV_TXCOMPLETE (includes waiting for RX windows) Leaving onEvent 1333197636: engineUpdate, opmode=0x900 =0x108 -1096335584: engineUpdate, opmode=0x108 -1096335366: engineUpdate, opmode=0x108 -1096334167: engineUpdate, opmode=0x108 -1096333227: engineUpdate, opmode=0x108 -1096331804: engineUpdate, opmode=0x108 -1096329617: engineUpdate, opmode=0x108

Also, attached a whole debug log file containing when a node is starting until the malfunction happens. teraterm2.log

Thanks & Regards, Robert

terrillmoore commented 1 week ago

Thank you for posting the log file.

At line 1509, I see EV_TXCOMPLETE. But after that I don't see any evidence that the code is calling the transmit path.

I see that ALL transmits in the recent path past prior to the failure are on channel 867,700,000 MHz. This behavior started at around line 269; prior to that it was using the other channels.

I don't know why the LMIC started to stick on one channel.

The device is transmitting at SF12. Based on the times printed, it appears that your message takes about 2.3 seconds of airtime.

The fact that there are no more launches of the transmission strongly indicates that sendjob has been lost from the queue. This can happen (in the sketch) if timing gets messed up. However, I suspect that lines were lost between 1511 and 1512; the line "=0x108" is almost certainly a fragment of the engineUpdate, opmode=0x108 printout that follows. Those lines are about 3.4ms apart (according to time stamp) and there's a huge jump between line 1511 (which is positive) and 1513 (which is negative). That's a jump of about 8 hours (coincidentally).

It is possible that on your BSP, something bad is happening to the LMIC's idea of time -- however, that wouldn't account for the missing text between 1511 and 1512.

In addition, by the way, you have an error in your do_send(). There are three paths in the if clauses; only one calls LMIC_setTxData2(). That means that if either of the other two paths are taken, the data won't be sent -- and that means you won't ever send again. The easy fix is to set DataTxError = 1 on the other two paths; that will at least result in a printout if it fails rather than printing

do_send finished
Tx result:0

suggesting that the sent was scheduled, even though it was not.

The opmode=0x108 printout is consistent with the LMIC waiting for channel bandwidth to become available (according to the duty cycle policy). If the clock is screwed up, it could wait approximately forever.

Try fixing the DataTxError problem, and try running again to see if you can get a full log with out missing lines around the time of the error. If necessary, stop (call abort, or just loop) if you see do_send returning non-zero.

Gooseman42 commented 1 week ago

Sadly I have the same problem here: getting stuck in a loop after 6-9h. I added some code inside the engineUpdate function to help me tag the problem (basically printing NOIDLE) but it did not help.. the code is basically still the same as the original ttn-otaa example. ttn_otaa_stuckloop.zip Serial log and code/libraries attached. If required I can provide more data. Thanks for any pointers! Achim

robepapp commented 6 days ago

Hi,

I modified the do_send() function of my otaa sketch to the same like in the ttn-otaa.ino example:

void do_send(osjob_t* j)
{
    // Check if there is not a current TX/RX job running
    if (LMIC.opmode & OP_TXRXPEND) {
        #ifdef SerialDbg 
        Serial1.println(F("OP_TXRXPEND, not sending"));
        #endif      
  }    
     else {
        // Prepare upstream data transmission at the next possible time.        
         #ifdef SerialDbg 
        Serial1.println(F("Prepare data for transmisson.")); 
        Serial1.print(F("Current interrupts:"));
        Serial1.println(inputCounter);
        #endif              
          mydata[0] = (int)((inputCounter & 0XFF));
          mydata[1] = (int)((inputCounter >> 8) & 0XFF);
          mydata[2] = (int)((inputCounter >> 16) & 0xFF);
          mydata[3] = (int)((inputCounter >> 24) & 0xFF);           
          LMIC_setTxData2(1, mydata, 3,0);
          #ifdef SerialDbg 
           Serial1.println(F("Pocket queued"));  
           #endif        
           } 
     #ifdef SerialDbg 
     Serial1.println(F("do_send finished"));  
     #endif  
    }

The test with the sketch ends with the same result new log file is uploaded.

However I have an important feedback regarding the time, when the problem occurs. Please check the following log part with system time included:

[2024-11-14 20:55:49.721] Leaving onEvent [2024-11-14 20:55:49.721] 2132085603: engineUpdate, opmode=0x900 [2024-11-14 21:05:49.705] Prepare data for transmisson. [2024-11-14 21:05:49.705] Current interrupts:1000 [2024-11-14 21:05:49.721] -2125381671: engineUpdate, opmode=0x908 [2024-11-14 21:05:49.754] Pocket queued [2024-11-14 21:05:49.754] do_send finished [2024-11-14 21:05:49.771] -2125378661: engineUpdate, opmode=0x108 [2024-11-14 21:05:49.789] -2125376359: engineUpdate, opmode=0x108 [2024-11-14 21:05:49.805] -2125376140: engineUpdate, opmode=0x108 [2024-11-14 21:05:49.822] -2125375424: engineUpdate, opmode=0x108 [2024-11-14 21:05:49.855] -2125374001: engineUpdate, opmode=0x108

There are no lost lines, logging is continuous when the system time changes its sign. The log from @Gooseman42 in this thread has that similar pattern, when regular, timed upload is interrupted and, in my case changes to occasional one-two logs per day.

otaa.log

Regards, Robert

terrillmoore commented 6 days ago

@robepapp Thanks for the log. @Gooseman42 sorry, wanted to get v5 out, and so that took all my time this weekend.

@robepapp Your code needs to capture (and print) the result of calling LMIC_setTxData2() -- it returns non-zero for failure. If it fails, you may not get a callback.

However, in this case, it probably didn't fail.

The opmode shows that the LMIC has chosen a new channel (900 -> 908 -> 108) and is waiting to send data (opmode & 0x0008).

I think the thing to do is confirm the path the code is taking. Please change line 2726 of lmic.c from:

    LMIC_X_DEBUG_PRINTF("%"LMIC_PRId_ostime_t": next engine update in %"LMIC_PRId_ostime_t"\n", now, txbeg-TX_RAMPUP);

to

    LMIC_DEBUG_PRINTF("%"LMIC_PRId_ostime_t": next engine update in %"LMIC_PRId_ostime_t"\n", now, txbeg-TX_RAMPUP);

I'm pretty certain that this path is being taken. If you are feeling brave, please add a couple of extra items to that print:

    LMIC_DEBUG_PRINTF("%"LMIC_PRId_ostime_t": next engine update in %"LMIC_PRId_ostime_t
        ". globalDutyRate=%" LMIC_PRId_ostime_t 
        " globalDutyAvail=%" LMIC_PRId_ostime_t 
        " txend=%" LMIC_PRId_ostime_t"
        " txcChnl=%d "
        "band=%d "
        "band.avail=%" LMIC_PRId_ostime_t
        "\n",
        now, txbeg-TX_RAMPUP, LMIC.globalDutyRate, LMIC.globalDutyAvail, LMIC.txend,
        LMIC.txChnl, LMIC.channelFreq[LMIC.txChnl] & 0x3,
        LMIC.bands[LMIC.channelFreq[LMIC.txChnl] & 0x3].avail);

This will show exactly what is going into the incorrect decision and will probably lead to a patch.

Gooseman42 commented 5 days ago

no worries about prioritizing v5, any help is very much appreciated. I just added the lines to my code and have it running, results tonight.. Btw, the second code block resulted in some compiler errors, I had to reformat it a bit (but now it seems to work):

    LMIC_DEBUG_PRINTF("%"LMIC_PRId_ostime_t": next engine update in %"LMIC_PRId_ostime_t". globalDutyRate=%"LMIC_PRId_ostime_t" globalDutyAvail=%"LMIC_PRId_ostime_t" txend=%"LMIC_PRId_ostime_t" txcChnl=%d band=%d band.avail=%"LMIC_PRId_ostime_t"\n",
        now, txbeg-TX_RAMPUP, LMIC.globalDutyRate, LMIC.globalDutyAvail, LMIC.txend,
        LMIC.txChnl, LMIC.channelFreq[LMIC.txChnl] & 0x3,
        LMIC.bands[LMIC.channelFreq[LMIC.txChnl] & 0x3].avail); 
Gooseman42 commented 4 days ago

there we are.. I think I see the problem but I am too far away from the code to propose a solution..

2116093015: engineUpdate, opmode=0x908
2116093085: EV_TXSTART
2116093214: TXMODE, freq=868100000, len=26, SF=7, BW=125, CR=4/5, IH=0
Packet queued
2116408828: setupRx1 txrxFlags 0x20 --> 01
start single rx: now-rxtime: 8
2116409474: RXMODE_SINGLE, freq=868100000, SF=7, BW=125, CR=4/5, IH=0
rxtimeout: entry: 2116409991 rxtime: 2116409453 entry-rxtime: 538 now-entry: 8 rxtime-txend: 312375
2116471328: setupRx2 txrxFlags 0x1 --> 02
start single rx: now-rxtime: 7
2116471973: RXMODE_SINGLE, freq=869525000, SF=9, BW=125, CR=4/5, IH=0
rxtimeout: entry: 2116473770 rxtime: 2116471953 entry-rxtime: 1817 now-entry: 8 rxtime-txend: 374875
2116473829: processRx2DnData txrxFlags 0x2 --> 00
2116473934: processDnData_norx txrxFlags 00 --> 20
2116474216: EV_TXCOMPLETE (includes waiting for RX windows)
2116474225: engineUpdate, opmode=0x900
2127724225: engineUpdate, opmode=0x908
2127724296: EV_TXSTART
2127724424: TXMODE, freq=868100000, len=26, SF=7, BW=125, CR=4/5, IH=0
Packet queued
2128040038: setupRx1 txrxFlags 0x20 --> 01
start single rx: now-rxtime: 8
2128040683: RXMODE_SINGLE, freq=868100000, SF=7, BW=125, CR=4/5, IH=0
rxtimeout: entry: 2128041201 rxtime: 2128040663 entry-rxtime: 538 now-entry: 8 rxtime-txend: 312375
2128102538: setupRx2 txrxFlags 0x1 --> 02
start single rx: now-rxtime: 7
2128103183: RXMODE_SINGLE, freq=869525000, SF=9, BW=125, CR=4/5, IH=0
rxtimeout: entry: 2128104981 rxtime: 2128103163 entry-rxtime: 1818 now-entry: 8 rxtime-txend: 374875
2128105040: processRx2DnData txrxFlags 0x2 --> 00
2128105143: processDnData_norx txrxFlags 00 --> 20
2128105425: EV_TXCOMPLETE (includes waiting for RX windows)
2128105434: engineUpdate, opmode=0x900
2139355435: engineUpdate, opmode=0x908
2139355505: EV_TXSTART
2139355634: TXMODE, freq=868100000, len=26, SF=7, BW=125, CR=4/5, IH=0
Packet queued
2139671248: setupRx1 txrxFlags 0x20 --> 01
start single rx: now-rxtime: 8
2139671892: RXMODE_SINGLE, freq=868100000, SF=7, BW=125, CR=4/5, IH=0
rxtimeout: entry: 2139672410 rxtime: 2139671872 entry-rxtime: 538 now-entry: 8 rxtime-txend: 312375
2139733747: setupRx2 txrxFlags 0x1 --> 02
start single rx: now-rxtime: 7
2139734392: RXMODE_SINGLE, freq=869525000, SF=9, BW=125, CR=4/5, IH=0
rxtimeout: entry: 2139736190 rxtime: 2139734372 entry-rxtime: 1818 now-entry: 8 rxtime-txend: 374875
2139736248: processRx2DnData txrxFlags 0x2 --> 00
2139736363: processDnData_norx txrxFlags 00 --> 20
2139736635: EV_TXCOMPLETE (includes waiting for RX windows)
2139736646: engineUpdate, opmode=0x900
-2143980652: engineUpdate, opmode=0x908
-2143980634: next engine update in 2139767259
-2143980634: next engine update in 2139767259. globalDutyRate=0 globalDutyAvail=2139767884 txend=-343980634 txcChnl=0 band=1 band.avail=2139741051
Packet queued
-2143968002: engineUpdate, opmode=0x108
-2143967783: next engine update in 2139767259
-2143967783: next engine update in 2139767259. globalDutyRate=0 globalDutyAvail=2139767884 txend=-343980634 txcChnl=0 band=1 band.avail=2139741051
-2143954882: engineUpdate, opmode=0x108
-2143954658: next engine update in 2139767259
-2143954658: next engine update in 2139767259. globalDutyRate=0 globalDutyAvail=2139767884 txend=-343980634 txcChnl=0 band=1 band.avail=2139741051
-2143941757: engineUpdate, opmode=0x108
-2143941533: next engine update in 2139767259
-2143941533: next engine update in 2139767259. globalDutyRate=0 globalDutyAvail=2139767884 txend=-343980634 txcChnl=0 band=1 band.avail=2139741051
-2143928632: engineUpdate, opmode=0x108
-2143928408: next engine update in 2139767259
-2143928408: next engine update in 2139767259. globalDutyRate=0 globalDutyAvail=2139767884 txend=-343980634 txcChnl=0 band=1 band.avail=2139741051
-2143915507: engineUpdate, opmode=0x108
-2143915283: next engine update in 2139767259
Gooseman42 commented 4 days ago

I actually left it running and the log showed a brief comeback but got stuck in what seems to be the same loop right away:

-7765254: next engine update in 2139767259
-7765254: next engine update in 2139767259. globalDutyRate=0 globalDutyAvail=2139767884 txend=-343980634 txcChnl=0 band=1 band.avail=2139741051
-7752400: engineUpdate, opmode=0x108
-7752191: next engine update in 2139767259
-7752191: next engine update in 2139767259. globalDutyRate=0 globalDutyAvail=2139767884 txend=-343980634 txcChnl=0 band=1 band.avail=2139741051
-7739338: engineUpdate, opmode=0x108
-7739129: next engine update in 2139767259
-7739129: next engine update in 2139767259. globalDutyRate=0 globalDutyAvail=2139767884 txend=-343980634 txcChnl=0 band=1 band.avail=2139741051
-7726275: engineUpdate, opmode=0x108
-7726066: next engine update in 2139767259
-7726066: next engine update in 2139767259. globalDutyRate=0 globalDutyAvail=2139767884 txend=-343980634 txcChnl=0 band=1 band.avail=2139741051
2139767259: engineUpdate, opmode=0x108
2139767327: EV_TXSTART
2139767455: TXMODE, freq=868100000, len=26, SF=7, BW=125, CR=4/5, IH=0
2140083069: setupRx1 txrxFlags 0x20 --> 01
start single rx: now-rxtime: 7
2140083713: RXMODE_SINGLE, freq=868100000, SF=7, BW=125, CR=4/5, IH=0
2140086532: decodeFrame txrxFlags 0x1 --> 21
2140086553: Received downlink, window=RX1, port=-1, ack=0, txrxFlags=0x21
2140086581: EV_TXCOMPLETE (includes waiting for RX windows)
2140086591: engineUpdate, opmode=0x910
2140086726: next engine update in 2140117205
2140086726: next engine update in 2140117205. globalDutyRate=0 globalDutyAvail=2140117830 txend=-354880570 txcChnl=0 band=1 band.avail=2140152875
Packet queued
-2143630684: next engine update in 2140117205
-2143630684: next engine update in 2140117205. globalDutyRate=0 globalDutyAvail=2140117830 txend=-354880570 txcChnl=0 band=1 band.avail=2140152875
-2143618053: engineUpdate, opmode=0x110
-2143617829: next engine update in 2140117205
-2143617829: next engine update in 2140117205. globalDutyRate=0 globalDutyAvail=2140117830 txend=-354880570 txcChnl=0 band=1 band.avail=2140152875
-2143604928: engineUpdate, opmode=0x110
-2143604704: next engine update in 2140117205
terrillmoore commented 3 days ago

The key printout:

-2143980634: next engine update in 2139767259. globalDutyRate=0 globalDutyAvail=2139767884 txend=-343980634 txcChnl=0 band=1 band.avail=2139741051

I cannot understand why txend is not "-2143980634" rather than "-343980634" -- the leading "214" is missing. It must be an artifact of debug prints. There's no easy way to change the hex value 0x803573A6 into 0xEB7F45A6. So I'm going to assume for the moment that it's not the problem.

"Next engine update" time matches globalDutyAvail - TX_RAMPUP. So we went through line 2587, because a randomized transmit time was chosen. It's in the past (in 32-bit arithmetic,

I bet the problem is at line 2601. In signed 32-bit arithmetic, txbeg - (now + TX_RAMPUP) is a negative number, -11,220,028. But it's possible that the compiler is getting confused by TX_RAMPUP (which is a complicated const expresssion) and somehow messing up. It doesn't always mess up -- only when txbeg is positive but now is negative. It also doesn't screw up on the compilers that MCCI is using -- this code is used everywhere on all platforms.

What version of GCC is being used on your port? I have seen exactly this kind of weird "optimization" problem -- arguably incorrect, but you can't win this kind of argument -- in newer compilers, in the name of either "efficiency" or "correctness". MCCI's BSPs use a relatively older version of the GCC (2017 or so).

My first thing to try would be to change line 2601 as follows:

if( (ostime_t) (txbeg - (now + TX_RAMPUP)) < 0 ) {

(In other words, explicitly cast the computation before comparing to zero. If by some chance it's been widened to int64_t, that will force it to be re-narrowed.)

If that doesn't work, I'd hit the compiler with a bigger hammer:

volatile ostime_t nonce = txbeg - (now + TX_RAMPUP);
if (nonce < 0) {

The volatile in this case (1) forces the compiler to store the expression in nonce, and (2) thereby forces the result to be treated as signed 32-bit. Then if (nonce < 0 { must be treated as signed 32-bit, because any peculiarity from the computation of nonce is, as it were, hidden by that explicit store.

If you are impatient, you can try the second fix first. But it would be good to try to narrow down exactly what is happening.

Gooseman42 commented 3 days ago

SoC is ESP32-S3, using Arduino 2.3.3. Per board manager and output window compile messages, the IDE is using the following tool (it also says version 2302 in the manager and compile window):

"xtensa-esp32s3-elf-g++.exe" --version
xtensa-esp32s3-elf-g++.exe (crosstool-NG esp-12.2.0_20230208) 12.2.0

Implementing fix (1) now..

terrillmoore commented 3 days ago

SoC is ESP32-S3, using Arduino 2.3.3. ... [g++ version] 12.2.0

I suppose the others (who are using STM32 but not MCCI's BSP) are also using newer compilers. It will be interesting to see what happens.

Gooseman42 commented 3 days ago

looks great, made it through the sign change:

2147268365: setupRx1 txrxFlags 0x20 --> 01
start single rx: now-rxtime: 7
2147269010: RXMODE_SINGLE, freq=867500000, SF=7, BW=125, CR=4/5, IH=0
rxtimeout: entry: 2147269528 rxtime: 2147268990 entry-rxtime: 538 now-entry: 8 rxtime-txend: 312375
2147330865: setupRx2 txrxFlags 0x1 --> 02
start single rx: now-rxtime: 7
2147331510: RXMODE_SINGLE, freq=869525000, SF=9, BW=125, CR=4/5, IH=0
rxtimeout: entry: 2147333308 rxtime: 2147331490 entry-rxtime: 1818 now-entry: 7 rxtime-txend: 374875
2147333366: processRx2DnData txrxFlags 0x2 --> 00
2147333470: engineUpdate, opmode=0x908
2147333684: next engine update in 2147364095
2147333684: next engine update in 2147364095. globalDutyRate=0 globalDutyAvail=2147364720 txend=-347633612 txcChnl=5 band=1 band.avail=2147338168
2147364095: engineUpdate, opmode=0x108
2147364164: EV_TXSTART
2147364293: TXMODE, freq=867500000, len=26, SF=7, BW=125, CR=4/5, IH=0
-2147287389: setupRx1 txrxFlags 00 --> 01
start single rx: now-rxtime: 8
-2147286744: RXMODE_SINGLE, freq=867500000, SF=7, BW=125, CR=4/5, IH=0
rxtimeout: entry: -2147286226 rxtime: -2147286764 entry-rxtime: 538 now-entry: 8 rxtime-txend: 312375
-2147224889: setupRx2 txrxFlags 0x1 --> 02
start single rx: now-rxtime: 8
-2147224244: RXMODE_SINGLE, freq=869525000, SF=9, BW=125, CR=4/5, IH=0
rxtimeout: entry: -2147222446 rxtime: -2147224264 entry-rxtime: 1818 now-entry: 8 rxtime-txend: 374875
-2147222387: processRx2DnData txrxFlags 0x2 --> 00
-2147222267: processDnData_norx txrxFlags 00 --> 20
-2147221979: EV_TXCOMPLETE (includes waiting for RX windows)
-2147221968: engineUpdate, opmode=0x900
-2135971967: engineUpdate, opmode=0x908
-2135971886: EV_TXSTART
-2135971755: TXMODE, freq=867700000, len=26, SF=7, BW=125, CR=4/5, IH=0
Packet queued
-2135656140: setupRx1 txrxFlags 0x20 --> 01
start single rx: now-rxtime: 8
-2135655494: RXMODE_SINGLE, freq=867700000, SF=7, BW=125, CR=4/5, IH=0
rxtimeout: entry: -2135654976 rxtime: -2135655515 entry-rxtime: 539 now-entry: 8 rxtime-txend: 312375
-2135593639: setupRx2 txrxFlags 0x1 --> 02
start single rx: now-rxtime: 8
-2135592994: RXMODE_SINGLE, freq=869525000, SF=9, BW=125, CR=4/5, IH=0
rxtimeout: entry: -2135591196 rxtime: -2135593015 entry-rxtime: 1819 now-entry: 7 rxtime-txend: 374875
-2135591138: processRx2DnData txrxFlags 0x2 --> 00
-2135591017: engineUpdate, opmode=0x908
-2135590797: next engine update in -2135560392
-2135590797: next engine update in -2135560392. globalDutyRate=0 globalDutyAvail=-2135559767 txend=-2135586351 txcChnl=0 band=1 band.avail=-2135586351
-2135560391: engineUpdate, opmode=0x108
-2135560323: EV_TXSTART
-2135560192: TXMODE, freq=868100000, len=26, SF=7, BW=125, CR=4/5, IH=0
-2135244578: setupRx1 txrxFlags 00 --> 01
start single rx: now-rxtime: 8
-2135243933: RXMODE_SINGLE, freq=868100000, SF=7, BW=125, CR=4/5, IH=0
-2135240789: decodeFrame txrxFlags 0x1 --> 21
-2135240769: Received downlink, window=RX1, port=-1, ack=0, txrxFlags=0x21
-2135240739: EV_TXCOMPLETE (includes waiting for RX windows)
-2135240724: engineUpdate, opmode=0x910
-2135240579: next engine update in -2135175400
-2135240579: next engine update in -2135175400. globalDutyRate=0 globalDutyAvail=-2135209490 txend=-2135174775 txcChnl=2 band=1 band.avail=-2135174775
-2135175400: engineUpdate, opmode=0x110
-2135175350: EV_TXSTART
-2135175227: TXMODE, freq=868500000, len=16, SF=7, BW=125, CR=4/5, IH=0
-2134860253: setupRx1 txrxFlags 0x21 --> 01
start single rx: now-rxtime: 7
-2134859608: RXMODE_SINGLE, freq=868500000, SF=7, BW=125, CR=4/5, IH=0
-2134856464: decodeFrame txrxFlags 0x1 --> 21
-2134856442: Received downlink, window=RX1, port=-1, ack=0, txrxFlags=0x21
-2134856414: EV_TXCOMPLETE (includes waiting for RX windows)
-2134856400: engineUpdate, opmode=0x910
-2134856252: next engine update in -2134825789
-2134856252: next engine update in -2134825789. globalDutyRate=0 globalDutyAvail=-2134825164 txend=-2134853783 txcChnl=6 band=1 band.avail=-2134853783
-2134825789: engineUpdate, opmode=0x110
-2134825739: EV_TXSTART
-2134825616: TXMODE, freq=867700000, len=14, SF=8, BW=125, CR=4/5, IH=0
-2134508706: setupRx1 txrxFlags 0x21 --> 01
start single rx: now-rxtime: 8
-2134508061: RXMODE_SINGLE, freq=867700000, SF=8, BW=125, CR=4/5, IH=0
rxtimeout: entry: -2134507159 rxtime: -2134508081 entry-rxtime: 922 now-entry: 8 rxtime-txend: 312375
-2134446206: setupRx2 txrxFlags 0x1 --> 02
start single rx: now-rxtime: 8
-2134445561: RXMODE_SINGLE, freq=869525000, SF=9, BW=125, CR=4/5, IH=0
rxtimeout: entry: -2134443763 rxtime: -2134445581 entry-rxtime: 1818 now-entry: 8 rxtime-txend: 374875
-2134443704: processRx2DnData txrxFlags 0x2 --> 00
-2134443583: processDnData_norx txrxFlags 00 --> 20
-2134443296: EV_TXCOMPLETE (includes waiting for RX windows)
-2134443282: engineUpdate, opmode=0x900
terrillmoore commented 3 days ago

Lovely, so it's a compiler "feature".

It suggests that I need to review all the code that uses ustoosticks(), because it's all potentially broken.

@Gooseman42 what board are you using for reference? Is it commercially available so I can add it to our menagerie and do additional testing?

Others (@robepapp, @mirhamza708): I will push a patch on a branch with this change:

if( (ostime_t) (txbeg - (now + TX_RAMPUP)) < 0 ) {

at line 2601 of lmic.c. If you can run tests and confirm that the problem is corrected, I'll declare victory and merge to HEAD and do a 5.0.1 release. I may need to do a 5.0.2 because I think I need to find all the ustoticks() and mstoosticks() and ensure that we've got a cast like the one at 2601 if needed.

mirhamza708 commented 2 days ago

Great that you found the issue. let me put a device for test today and will get back to you tomorrow with updates.

terrillmoore commented 2 days ago

@mirhamza708 wrote:

Great that you found the issue. let me put a device for test today and will get back to you tomorrow with updates.

Thanks. All, I have pushed a branch https://github.com/mcci-catena/arduino-lmic/tree/issue968, which contains the release candidate fix. Please test if you have a chance.

Gooseman42 commented 2 days ago

returned to the lab this morning to see the code run through without problems, congrats so far. I have seen however a funny thing with the debug messages (only showing those now):

329812532: next engine update in 330134792. globalDutyRate=0 globalDutyAvail=329843573 txend=330135417 txcChnl=7 band=1 band.avail=330135417
329812532: next engine update in 330134792. globalDutyRate=0 globalDutyAvail=329843573 txend=330135417 txcChnl=7 band=1 band.avail=330135417
342153522: next engine update in 342475785. globalDutyRate=0 globalDutyAvail=342184563 txend=342476410 txcChnl=1 band=1 band.avail=342476410
342153522: next engine update in 342475785. globalDutyRate=0 globalDutyAvail=342184563 txend=342476410 txcChnl=1 band=1 band.avail=342476410
354494515: next engine update in 354524932. globalDutyRate=0 globalDutyAvail=354525557 txend=-2140472781 txcChnl=1 band=1 band.avail=354817404
354494515: next engine update in 354524932. globalDutyRate=0 globalDutyAvail=354525557 txend=-2140472781 txcChnl=1 band=1 band.avail=354817404
366543660: next engine update in 366574086. globalDutyRate=0 globalDutyAvail=366574711 txend=-2128423636 txcChnl=1 band=1 band.avail=366866549
366543660: next engine update in 366574086. globalDutyRate=0 globalDutyAvail=366574711 txend=-2128423636 txcChnl=1 band=1 band.avail=366866549
378534416: next engine update in 378564912. globalDutyRate=0 globalDutyAvail=378565537 txend=-2116432880 txcChnl=1 band=1 band.avail=378915704

likely the same problem with txend but shows that other instances in the code might need the cast.

@Gooseman42 what board are you using for reference? Is it commercially available so I can add it to our menagerie and do additional testing?

@terrillmoore custom-designed, but basically a plain ESP32-S3 connected to an RFM95W without anything else, optimized for low deep-sleep. I could send you one if you really are interested :)

Gooseman42 commented 2 days ago

we are not completely out of the woods yet. I have not touched the system and yet we have a similar situation during the day. First element is a locally generated timestamp from the capture software, this is not coming from the code:

"22.11.2024 12:15:23",2144764449: engineUpdate, opmode=0x908
"22.11.2024 12:15:23",2144764519: EV_TXSTART
"22.11.2024 12:15:23",2144764647: TXMODE, freq=868300000, len=26, SF=8, BW=125, CR=4/5, IH=0
"22.11.2024 12:15:23",Packet queued
"22.11.2024 12:15:29",2145083478: setupRx1 txrxFlags 0x20 --> 01
"22.11.2024 12:15:29",start single rx: now-rxtime: 8
"22.11.2024 12:15:29",2145084123: RXMODE_SINGLE, freq=868300000, SF=8, BW=125, CR=4/5, IH=0
"22.11.2024 12:15:29",rxtimeout: entry: 2145085024 rxtime: 2145084102 entry-rxtime: 922 now-entry: 8 rxtime-txend: 312375
"22.11.2024 12:15:30",2145145978: setupRx2 txrxFlags 0x1 --> 02
"22.11.2024 12:15:30",start single rx: now-rxtime: 8
"22.11.2024 12:15:30",2145146623: RXMODE_SINGLE, freq=869525000, SF=9, BW=125, CR=4/5, IH=0
"22.11.2024 12:15:30",rxtimeout: entry: 2145148420 rxtime: 2145146602 entry-rxtime: 1818 now-entry: 8 rxtime-txend: 374875
"22.11.2024 12:15:30",2145148478: processRx2DnData txrxFlags 0x2 --> 00
"22.11.2024 12:15:30",2145148583: processDnData_norx txrxFlags 00 --> 20
"22.11.2024 12:15:30",2145148866: EV_TXCOMPLETE (includes waiting for RX windows)
"22.11.2024 12:15:30",2145148875: engineUpdate, opmode=0x900
"22.11.2024 12:18:30",-2138568421: engineUpdate, opmode=0x908
"22.11.2024 12:18:30",-2138568405: next engine update in 2145179490
"22.11.2024 12:18:30",-2138568405: next engine update in 2145179490. globalDutyRate=0 globalDutyAvail=2145180115 txend=-338568405 txcChnl=1 band=1 band.avail=2145471665
"22.11.2024 12:18:30",Packet queued
"22.11.2024 12:18:30",-2138555736: engineUpdate, opmode=0x108
"22.11.2024 12:18:30",-2138555517: next engine update in 2145179490
"22.11.2024 12:18:30",-2138555517: next engine update in 2145179490. globalDutyRate=0 globalDutyAvail=2145180115 txend=-338568405 txcChnl=1 band=1 band.avail=2145471665
"22.11.2024 12:18:30",-2138542616: engineUpdate, opmode=0x108
"22.11.2024 12:18:30",-2138542392: next engine update in 2145179490
"22.11.2024 12:18:30",-2138542392: next engine update in 2145179490. globalDutyRate=0 globalDutyAvail=2145180115 txend=-338568405 txcChnl=1 band=1 band.avail=2145471665
"22.11.2024 12:18:30",-2138529491: engineUpdate, opmode=0x108
"22.11.2024 12:18:30",-2138529267: next engine update in 2145179490
"22.11.2024 12:18:30",-2138529267: next engine update in 2145179490. globalDutyRate=0 globalDutyAvail=2145180115 txend=-338568405 txcChnl=1 band=1 band.avail=2145471665
"22.11.2024 12:18:30",-2138516366: engineUpdate, opmode=0x108
"22.11.2024 12:18:30",-2138516142: next engine update in 2145179490
"22.11.2024 12:18:31",-2138516142: next engine update in 2145179490. globalDutyRate=0 globalDutyAvail=2145180115 txend=-338568405 txcChnl=1 band=1 band.avail=2145471665
"22.11.2024 12:18:31",-2138503241: engineUpdate, opmode=0x108
"22.11.2024 12:18:31",-2138503017: next engine update in 2145179490
"22.11.2024 12:18:31",-2138503017: next engine update in 2145179490. globalDutyRate=0 globalDutyAvail=2145180115 txend=-338568405 txcChnl=1 band=1 band.avail=2145471665
terrillmoore commented 2 days ago

@Gooseman42 I suspect we got lucky the first time. I will update my patch to include the volatile trick and will advise when it passes CI tests.

terrillmoore commented 2 days ago

@Gooseman42 change pushed, but need to pass the CI tests before you try it.

terrillmoore commented 2 days ago

@Gooseman42 -- It passed CI, so it's ok to try it.

Interestingly enough, looking at the logs, the other part of the LMIC knows that txbeg is in the past -- because it's immediately triggering the callback which calls engineUpdate() which ends up deciding that txbeg is in the future and then re-queues the event. I hope that introducing an aux variable will help. If not, I may need you to run objdump (an ELF analysis tool) to disassemble the function and look at what code is being generated. (This can also be done easily with GDB if you have that working -- just load up the image and disassemble the function, Because of optimization, it's safer to disassemble the whole containing function than to just look at the source line in question.)