matthijskooijman / arduino-lmic

:warning: This library is deprecated, see the README for alternatives.
707 stars 649 forks source link

"Space Time Continuum" Problems With hal_ticks()? #15

Closed ReliableNetResults closed 8 years ago

ReliableNetResults commented 8 years ago

Hi,

Please forgive the facetious issue subject title - hopefully the reason for its naming will become apparent... :)

Long-story-short: hal_ticks() (in hal.cpp) can return a time in the past which (I currently believe) is causing my nodes to stall when left for long enough.

Observations: Calls to os_getTime() go negative when they get big enough and then at some point (I think!) are causing a stall in the HAL after "Packet queued" is printed on the Serial port. For example, I see the following in my Serial Monitor:

2145760172: EV_TXCOMPLETE (includes waiting for RX windows) Packet queued [Nxt Tick] -2147000220: EV_TXCOMPLETE (includes waiting for RX windows) Packet queued [More Ticks...] -992562099: EV_TXCOMPLETE (includes waiting for RX windows) Packet queued [Infinite stall after "Packet queued" is printed. Only a Microcontroller reset will resolve.]

Digging into the code...

It looks like the problem could be when subsequent calls to hal_ticks() straddle a "time travel" because the delta_time() function (in hal.cpp, line 161) checks for a relative time in the past (i.e. job deadline - current hal_ticks() time) and although hal_ticks() returns a u4_t, it is often set to an s4_t in various other placed in the code - I haven't checked for consistency everywhere.

I appreciate that the timing overflow issue has to be dealt with, since we can't just keep counting higher forever. But I don't understand the code well enough to be sure that when the overflow happens, current jobs and pending jobs all behave appropriately given the delicate spacing between transmissions.

My current attempt at a fix (read: dodgy hack!) is to check the proposed deadline time before setting it through the os_setTimedCallback whilst preserving the TX_INTERVAL, etc. and make sure that no "negative" times are set in the job queue.

What do you think?

Any help or advice would be much appreciated.

Thanks,

Andy

matthijskooijman commented 8 years ago

Hm, interesting. There have been some issues with overflow, which I fixed in https://github.com/matthijskooijman/arduino-lmic/commit/f19d2cb759bf4d3ba23239358c1752ff88bad6a4. I also remember testing this works correctly through a micros() overflow (by running for 2^32 us == 1.2 hours) and a hal_ticks() overflow (2^32 * 16us == 19 hours). Comparison of timestamps should work correctly, even in the face of overflow, because delta_time() casts its result to a signed integer, meaning a negative delta means "in the past", while a positive time is "in the future" (which works up to 2**31 ticks in either direction).

It is true that times are often cast to signed, which I'm not entirely fond of, but AFAICS it should not cause any problems.

Are you using the latest git version? Is the stall reproducible? At what time does it occur? What hardware are you using?

matthijskooijman commented 8 years ago

@ReliableNetResults, any chance you can answer my above questions?

matthijskooijman commented 8 years ago

I'm closing this for lack of response. I did a more extensive test and it seems that micros() overflow is correctly handled right now as well. If you have more info, feel free to reopen.