mcci-catena / arduino-lmic

LoraWAN-MAC-in-C library, adapted to run under the Arduino environment
https://forum.mcci.io/c/device-software/arduino-lmic/
MIT License
638 stars 208 forks source link

Link dead, not transmitting #475

Open jpmeijers opened 4 years ago

jpmeijers commented 4 years ago

Lmic prints out the following message when I want to transmit data:

[09:12:39:200] TX user data (buffered in pendTxData)
[09:12:39:360] TX/RX transaction pending
[09:12:39:472] find a new channel
[09:12:39:567] link was reported as dead

This happens after the board has been running for a couple of days. Debugging it is therefore tricky as it is not easily reproducible. Any pointers where I should look would be appreciated.

jpmeijers commented 4 years ago

My transmit function is below. From what I see printed on my Oled (LoRa TXing) I can see that the code gets stuck in the while loop: while( (LMIC.opmode & OP_TXDATA) && (millis() - start < TX_TIMEOUT) )

Inside this loop print_lmic_opmode(); is called once per second. This call prints out the link dead message. After 60 seconds this times out. A while later another message is queued for TX, resulting in the same.

void tx_blocking(uint8_t* radioPacket, uint8_t packetLength)
{

  u8x8.setCursor(0, 2);
  u8x8.clearLine(2);
  u8x8.print("LoRa TX start");

  digitalWrite(ANTENNA_SWITCH, HIGH);

  uint16_t loopCount=0;
  do_send(&sendjob, radioPacket, packetLength);
  print_lmic_opmode();
  uint32_t start = millis();
  uint32_t prevPrint = 0;
  while( LMIC.opmode & OP_JOINING ) {
    loopCount++;
    os_runloop_once();
    if(millis() - prevPrint > 10000) {
      prevPrint = millis();
      print_lmic_opmode();
    }

    if(millis()>1000 && millis()-1000 > oledUptimeUpdated) {
      oledUptimeUpdated = millis();

      u8x8.setCursor(0, 2);
      u8x8.clearLine(2);
      u8x8.print("LoRa Joining");
      u8x8.setCursor(8, 1);
      u8x8.print(millis()/1000);
    }
  }
  while( (LMIC.opmode & OP_TXDATA) && (millis() - start < TX_TIMEOUT) )
  {
    loopCount++;
    os_runloop_once();
    if(millis() - prevPrint > 1000) {
      prevPrint = millis();
      print_lmic_opmode();
    }

    if(millis()>1000 && millis()-1000 > oledUptimeUpdated) {
      oledUptimeUpdated = millis();

      u8x8.setCursor(0, 2);
      u8x8.clearLine(2);
      u8x8.print("LoRa TXing");
      u8x8.setCursor(8, 1);
      u8x8.print(millis()/1000);
    }
  }
  os_runloop_once();
  print_lmic_opmode();
  os_runloop_once();
  print_lmic_opmode();

  digitalWrite(ANTENNA_SWITCH, LOW);

  u8x8.setCursor(0, 2);
  u8x8.clearLine(2);
  u8x8.print("TXdone: ");
  u8x8.print(millis()/1000);
}

Around the time when the timeout happens the UART output looks like this:

[09:24:16:005] TX/RX transaction pending
[09:24:16:117] find a new channel
[09:24:16:213] link was reported as dead
[09:24:16:325] TX user data (buffered in pendTxData)
[09:24:16:485] TX/RX transaction pending
[09:24:16:597] find a new channel
[09:24:16:677] link was reported as dead
[09:24:16:789] OP_TXRXPEND, not sending
[09:24:16:900] TX user data (buffered in pendTxData)
[09:24:17:060] TX/RX transaction pending
[09:24:17:172] find a new channel
[09:24:17:252] link was reported as dead
[09:24:17:364] TX user data (buffered in pendTxData)
[09:24:17:540] TX/RX transaction pending
[09:24:17:652] find a new channel
[09:24:17:732] link was reported as dead
terrillmoore commented 4 years ago

I'm tied up today with other things. But I would like to see the value (in hex) of LMIC.opmode during this transition. I'm take it you're running EU868. Are you sure you're not running into duty-cycle limitations? The new LMIC does a much better job of honoring, There's a dedicated job (LMIC.osjob) -- please print out the value of the LMIC.osjob.func as a pointer, then you can match that with the link map -- if necessary comment out the static on the job functions, there are only a handful. This is why I want to change the LMIC to use a true FSM (like what's in the compliance sketch) -- smaller, and easier to see what's going on.

terrillmoore commented 4 years ago

Reread and corrrected prev comment.

Several things are happening.

  1. for clarity, these messages are from your app, not from the LMIC.
  2. the LMIC has ADR enabled, and it's received no downlinks, so it's posted "link dead". This doesn't affect uplink at all. The LMIC continues honoring your requests. "link dead" will cause the LMIC to eventually try to rejoin.
  3. there is something slowing down uplinks. I suspect duty cycle; once you go to SF12, you run out of bandwidth very quickly and things get slow.
jpmeijers commented 4 years ago

Yes, using EU868.

For the time being I disabled Link checks (LMIC_setLinkCheckMode(0);). It will have to run for a week until I'm back from The Things Conf India, after which I can add more debug prints.

The duty cycle can have an effect, although I don't believe I'm transmitting that much. I however do not call os_runloop_once(); when I am not transmitting, but I believe the duty cycle is calculated using the system millis() and not only when os_runloop_once(); is called.

During the coarse of the past week I did disable my gateway's downlinks to test another issue with TTN. Could it be possible that LMIC's link check broke because of this?

For clarity sake I'm attaching my full radio management functions file. Perhaps I am doing something wrong that breaks LMIC assumptions.

radio.cpp.txt

terrillmoore commented 4 years ago

This is probably not your problem. However, I am not sure that you can get away with calling os_runloop_once() once a second, without first checking that there are no critical events scheduled during the next second. I would recommend calling os_queryTimeCriticalJobs(ms2osticks(1000)); while it's false, you should be calling the run loop much more often, at least every 10 ms. This is the design of the LMIC; there are hard real-time events, and there's no way to tell if one is pending. It would be better if it worked differently, and that's on the list; but for now, I'd not trust the LMIC to do the right thing unless you're sure it's not got something it needs to do.

Be aware that with the new MAC and downlink processing code, uplinks can get scheduled based on downlinks. This can go on for an unbounded period of time (for example, if the network sends a series of port-0 commands).

terrillmoore commented 4 years ago

until I'm back from The Things Conf India

Enjoy! Say hello to MCCI's @svelmurugan92 and @dhineshkumarmcci who are attending and staffing a pod at the exhibition.

jpmeijers commented 4 years ago

Adding os_runloop_once() to my main loop on Arduino and not going to sleep does seem to solve the problem. But because I want to know I can go into deep sleep after my tx_blocking() function returns, I added your suggestion to the end of that function. Now in theory when the tx_blocking() function returns I should be sure that it is safe to go into deep sleep mode for 8 seconds.

  // If we have any critical jobs to complete in the next 8 seconds, do not sleep but process them
  while(os_queryTimeCriticalJobs(ms2osticks(8000)) != 0)
  {
    loopCount++;
    os_runloop_once();
    if(millis() - prevPrint > 1000) {
      prevPrint = millis();
      print_lmic_opmode();
    }
  }
jpmeijers commented 4 years ago

Sorry this is going to be a little vague, but I wanted to post an update. I have a board based on an atmega processor. It transmits via LoRa every 10 minutes. In between it sleeps for 8 seconds using the watchdog timer. The above issue occurred in the past, but then I added the code to check for any critical jobs in the coming 8 seconds before going to sleep. Now what is happening is that after 11h10m it stops transmitting because LMIC thinks the state is still in OP_TXRXPEND, even though it should have been done a long time ago.

[10:47:12:167] Should be every 10 minutes
[10:47:12:241] OP_TXRXPEND, not sending
[10:47:12:241] TX/RX transaction pending
[10:47:12:241] find a new channel
terrillmoore commented 4 years ago

Probably the thing to do is print out the hex value of LMIC.opmode, along with the ostime_t time value of the pending job and the current ostime_t value. It seems possible that something is triggering a re-join due to no downlink traffic. That would legitimately leave the LMIC busy. That's also based on message count, and it looks as if you've transmitted 660 messages. I don't remember the count that triggers things, but it's close. You could find it easily with a search, I think. If not, I'll dig into it but it will have to be this weekend.