mcci-catena / arduino-lmic

LoraWAN-MAC-in-C library, adapted to run under the Arduino environment
https://forum.mcci.io/c/device-software/arduino-lmic/
MIT License
636 stars 207 forks source link

Library continually hangs on startrx() in radio.c #354

Closed Mark-Wills closed 5 years ago

Mark-Wills commented 5 years ago

Using the libary with an ATMEGA328P, I keep observing that the library hangs on the following line:

ASSERT( (readReg(RegOpMode) & OPMODE_MASK) == OPMODE_SLEEP );

For context, here's the containing function:

static void startrx (u1_t rxmode) { //ASSERT( (readReg(RegOpMode) & OPMODE_MASK) == OPMODE_SLEEP ); if(getSf(LMIC.rps) == FSK) { // FSK modem rxfsk(rxmode); } else { // LoRa modem rxlora(rxmode); } // the radio will go back to STANDBY mode as soon as the RX is finished // or timed out, and the corresponding IRQ will inform us about completion. As you can see, I have commented the line out, just to see what happens. It actually works much better.

I also get UNKNOWN EVENT: 20 very often when attempting to join the network. When this happens, it remains in a join-attempt/unknown event loop until the CPU is reset.

cstratton commented 5 years ago

Indeed, asserting there does not seem to be good strategy at all. Regardless if it is an SPI communication issue or a program state issue, it really should be "recoverable" by trying again without losing overall LoRa state (registration, fcount, etc) which would be the cost of asserting and then letting something like a watchdog reset (hopefully) occur would.

As for what is actually causing it, do you know if SPI communication with the radio is actually working? I'd probably replace the assert with debug output to print what the register value actually is. And try to see if you can verify by something else that SPI communication with the radio is working (for example, uplink transmissions being received by a gateway).

Mark-Wills commented 5 years ago

Chris,

Thanks for the reply. I can confirm that the unit does work because I am able to successfully join the LoRa network and send data (when the assert line mentioned above is commented out). It is prone to random lock-ups though. Have you had any reports of random lockups? It could be because the circuit is on a breadboard. I have just designed some carrier boards for the SX1276 with 2.54mm pitch header pins to make things much easier and hopefully more reliable. Also includes an SMA connector. If you would like a couple to help you with the library development send me an email (mark wills 1970 at gmail dot com) and I'll pop a couple in the post to you (with 868mhz SX1276 unit pre-soldered).

I wonder if you could give me some pointers for enabling debug output in the library? I see from the docs that printf is required, but my google foo is obviously not up to scratch; I've not been able to uncover anything definitive for using printf in the Arduino (ATMEGA) environment. If you have any links or pointers I'd be most grateful. I tried just regular Serial.println but the library does not compile (doesn't recognise the Serial object) - I probably need to #include something in the library for that to work, but I'm not sure what (one of the disadvantages of the Arduino pre-compiler hand-holding, I guess).

I'll report back in a week or so; I have family commitments this weekend which will prevent me from looking at the issue further until next week.

Kind regards

Mark

terrillmoore commented 5 years ago

Hi @Mark-Wills,

There are two distinct issues here.

The unknown event is because of the new events that are printed by the HEAD LMIC. That is EV_JOIN_TXCOMPLETE, which says that a join transmit/listen cycle is complete. It should be ignored. (If you're getting it a lot, it means you have timing accuracy problems, or receive downlink problems.)

The other issue you're seeing is almost certainly a hardware problem. It most likely means that the SX1276 is getting reset. Commenting out the assert allows things to seem to continue, but the code doesn't re-establish invariants, re-run the cal, etc., so if this happens, you're basically not likely to have things work until you stop and restart the LMIC altogether. There is a hook for catching ASSERTs now, but... you should find out what is causing that. I have never seen that on any of our boards.

Debug outputs on atmega... Did you try defining LMIC_DEBUG_LEVEL and possibly LMIC_PRINTF_TO as specified in the README.md at https://github.com/mcci-catena/arduino-lmic#changing-debug-output and following?

Mark-Wills commented 5 years ago

Terry,

Many thanks for your prompt reply. I'm inclined to agree that the locking issue I am observing is probably hardware related. I'll endevaour to build something more robust and give it another try. It might be sensible to close this issue until I've had a chance to experiment further, especially if I'm the only person raising this issue.

Regarding the asserts... are they wise? There doesn't seem to be a way to recover from a failure. You mentioned a hook. Is this documented? (I'm very new to the library and not an expert in C, so my apologies if this is newbie stuff).

I did not change the LMIC_DEBUG_LEVEL and associated #defines because of the following paragraph in the documentation:

_"If LMIC_DEBUGLEVEL is zero, no output is produced. If 1, limited output is produced. If 2, more extensive output is produced. If non-zero, printf() is used, and the Arduino environment must be configured to support it, otherwise, the sketch will crash at runtime."

I don't know how to get printf working in an Arduino environment. Is it just a matter of downloading one of the myriad printf libraries that appear to be available and #including it in the LMIC library?

Regards

Mark

terrillmoore commented 5 years ago

My understanding is that the AVR libraries already include a suitable printf. However, ARM versions often do not. I avoid using printfs for debugging because they affect timing dramatically, and timing is everything with this library, and I don't use AVRs at all; so I have not really experimented with this. I have had it working on other platforms, and I know others have had success. Since you're using AVR, you should try setting LMIC_DEBUG_LEVEL to 1 and see what happens. If it doesn't work, you'll know right away.

cstratton commented 5 years ago

While LMiC does have timing critical portions, that's mostly the task dispatch to receive pathway. I've had fairly good luck in the past with quite chatty debug prints outside of that, even including in the immediate aftermath of the TX path (it just needs to be ready to receive within a second). And it's fairly safe to print after the receive attempt has succeed or failed.

From the standpoint of debugging the present issue, even debug prints in startrx() that lead to missing receive windows could temporarily be worthwhile to figure out what is going on with the radio chip state, if they are removed once that is solved. Given that the Serial class can't readily be used from C code one idea could be to whip up a couple of extern "C" functions that pass through to the null-terminated-string and hex byte variants.

The radio being reset does seem like a good suspect; could be a wiring issue, a pin mode issue, or even a power brownout that hits corrupts the radio but not the ATmega.

terrillmoore commented 5 years ago

I'm happy for others to use debug prints with this code. It does work fine as long as you're only doing uplink. Downlink (joins in particular) are likely to stop working. The code is convoluted and there are paths that will nail you. I found this particularly when debugging the compliance sketch. For what I'm doing, the hardware is not an issue; I find that either sequence of event recording (see the compliance sketch) or simply printing the state after a macro operation completes is much more informative. But I'm using production hardware, so I just don't worry about debugging low-level issues; they basically are not the problems I'm chasing.

terrillmoore commented 5 years ago

Hi @Mark-Wills -- is this resolved?

Mark-Wills commented 5 years ago

Hi @terrillmoore I believe this issue is now fixed. I have moved from veroboard to a custom designed PCB and it seems to be working fine, though I question the value of the using ASSERT.

terrillmoore commented 5 years ago

Thanks closing the request. I agree that ASSERT is not really helpful in practice. This is something inherited from the original IBM code base, and they did it to help with development. It's probably time to think about replacing them all, probably with the logging interface from the compliance efforts.