mcci-catena / arduino-lmic

LoraWAN-MAC-in-C library, adapted to run under the Arduino environment
https://forum.mcci.io/c/device-software/arduino-lmic/
MIT License
641 stars 208 forks source link

v4.1.1 causes device not to wake - device lockup during sleep #851

Closed pomplesiegel closed 2 years ago

pomplesiegel commented 2 years ago

Summary

With LMIC v4.1.1 our Feather M0 stops waking from sleep after ~2000 transmits/sleeps. The issue does not occur with v4.1.0, with all other variables kept the same.

We are using the library RTCZero for sleeps in-between transmits. Still using RTCZero v1.5.3 since July 2020, as RTCZero v1.6 caused an issue, with sleep as well (additional info on that here).

Everything functions perfectly with LMIC v4.1.0, but with v4.1.1 the device stops waking from sleep after about 1000-2000 transmit/sleep combinations. During sleep the device becomes locked and needs to be physically reset.

Has something been changed which would affect interrupts/sleep behavior on an M0 ATSAMD21G18?

Thank you for all your hard work!

Environment

LMIC = v4.1.1 Arduino IDE 1.8.19 Arduino SAMD BSP 1.8.12 Adafruit SAMD BSP 1.7.9 RTCZero 1.5.3 - https://github.com/arduino-libraries/RTCZero/ - de4016b (1.6.0 still has sleeping issue) TTN - US915 Adafruit Feather M0 LORA (includes HopeRF RFM95CW radio)

terrillmoore commented 2 years ago

The only thing I can suggest is to use git bisect to try to find the offending change. I know of nothing that would affect this. (The GitHub version comparison shows that the only code changes are in the processing of channel masks on downlinks, and these only affect messages sent by ChirpStack, not TTN.)

Can you attach an SWD debugger to the Feather and break in with GDB at the point of the hang? I think the notes from https://github.com/mcci-catena/Catena-Sketches/blob/master/extra/HOWTO-DEBUG-WITH-GDB.md would also apply to using a J-Link (allowing for differences in setting up the port).

BTW: How often are you sending uplinks?

pomplesiegel commented 2 years ago

Thank you! I'll check this out.

We are sending uplinks quite frequently, as this is a "stress test" in an isolated RF environment, as a regression test to watch for stability issues.

Device's behavioral loop: Measure every 5s, sleeping in-between measurements and transmitting after every 4th measurement (every 20 seconds).

terrillmoore commented 2 years ago

Is the isolated RF environment also using TTN as the network server?

pomplesiegel commented 2 years ago

Yes, but TTI technically (the free 30-device tier).

terrillmoore commented 2 years ago

I recommend you call os_queryTimeCriticalJobs() [and avoid using the LMIC delay queue for anything other than LMIC internal purposes -- this is despite the fact that we use them in the example sketches]. Don't sleep until os_queryTimeCriticalJobs(sleepTime) returns 0 for the selected sleep interval.

By the way: remember that (as documented in the 4.1.1 README, but as has been true for a long time), interrupts from the radio do not cause the LMIC to actually do work; they just set flags and record time stamps so that the next time you call os_runloop_once(), the LMIC will process the interrupts.

Also, bear in mind that with the 5 second delay in TTN v3 for RX1 (and 6 second for Rx2), you're not going to be able to sleep as nicely with v3. You'll need to sleep at most 4 seconds after the uplink transmit completes, then (since typically there's no downlink), sleep another 0.8 second, wake up for RX2, and then sleep for the next little bit. It's hard to predict.

There is no way with the stock LMIC to find out how long you can sleep other than to poll os_queryTimeCriticalJobs(). It's on the list to make this easier, but there are some subtleties and I wanted to defer dealing with those. (You can, of course, modify the HAL add an API to query the queue and find the delay.)

pomplesiegel commented 2 years ago

After investigation I believe this is actually totally unrelated to v4.1.1 and just a very annoying coincidence. Sorry for the confusion and wasted time!

terrillmoore commented 2 years ago

@pomplesiegel no worries... "it's wireless..."