mcci-catena / arduino-lorawan

User-friendly library for using arduino-lmic with The Things Network and other LoRaWAN® networks
MIT License
274 stars 54 forks source link

Kernel panic on Arduino v2.0 #181

Open olicooper opened 2 years ago

olicooper commented 2 years ago

The library worked fine on Arduino 1.0.6, but after upgrading to v2.0.0 the device reboots on the next loop cycle after calling Arduino_LoRaWAN::SendBuffer(). I have a Lora wrapper class which simply calls Arduino_LoRaWAN::loop() - it is failing at this point.

environment:

code

void Lora::NetBeginRegionInit() {
    this->Super::NetBeginRegionInit();
}

void Lora::NetJoin() {
    this->Super::NetJoin();
    SetLinkCheckMode(true);
    LMIC_setPingable(4);
}

void Lora::NetTxComplete() {
    this->Super::NetTxComplete();
}

bool Lora::GetOtaaProvisioningInfo(OtaaProvisioningInfo *pInfo) {
    if (pInfo != nullptr) {
        memcpy_P(pInfo->AppKey, APPKEY, sizeof(pInfo->AppKey));
        memcpy_P(pInfo->DevEUI, DEVEUI, sizeof(pInfo->DevEUI));
        memcpy_P(pInfo->AppEUI, APPEUI, sizeof(pInfo->AppEUI));
    }
    return true;
}

void Lora::loop() {
    if (!enabled) return;
    Super::loop(); // <-- Failing here
}

// ... then 'basicStatePublishJob' calls SendBuffer SendBuffer(dataBuffer, dataBuffer_s, pDoneFn, (void )this, /confirmed/ 0, /port*/ 1);


### core dump
```log
Guru Meditation Error: Core  1 panic'ed (LoadStoreError). Exception was unhandled.

Core  1 register dump:
PC      : 0x400f4ed7  PS      : 0x00060030  A0      : 0x800f4f55  A1      : 0x3ffce000
A2      : 0x4002f56e  A3      : 0x00000000  A4      : 0x00000000  A5      : 0x0013a16a  
A6      : 0xfffffffc  A7      : 0x00000003  A8      : 0x00000000  A9      : 0x3ffc7c18
A10     : 0x00000010  A11     : 0x00000016  A12     : 0x00000000  A13     : 0x00040df7  
A14     : 0x63ffffff  A15     : 0x00000003  SAR     : 0x0000001b  EXCCAUSE: 0x00000003
EXCVADDR: 0x4002f56e  LBEG    : 0x400f4e88  LEND    : 0x400f4ee6  LCOUNT  : 0x00000004  

0x400f4ed4: Arduino_LoRaWAN::BuildSessionState(Arduino_LoRaWAN::SessionState_u&) const at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:369
 (inlined by) ?? at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:359
 (inlined by) Arduino_LoRaWAN::BuildSessionState(Arduino_LoRaWAN::SessionState_u&) const at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/lib/arduino_lorawan_sessionstate.cpp:114
0x400f4f52: Arduino_LoRaWAN::SaveSessionState() at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/lib/arduino_lorawan_sessionstate.cpp:190
0x400f4b0b: Arduino_LoRaWAN::StandardEventProcessor(unsigned int) at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:925
 (inlined by) Arduino_LoRaWAN::StandardEventProcessor(unsigned int) at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/lib/arduino_lorawan_begin.cpp:214
0x400f4be2: Arduino_LoRaWAN::DispatchEvent(unsigned int) at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/lib/arduino_lorawan_begin.cpp:123
0x400f4c24: onEvent at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/lib/arduino_lorawan_begin.cpp:109
0x400f07fd: reportEventNoUpdate at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI LoRaWAN LMIC library/src/lmic/lmic.c:481
0x400f1d3d: reportEventAndUpdate at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI LoRaWAN LMIC library/src/lmic/lmic.c:466
0x400f203a: startJoining at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI LoRaWAN LMIC library/src/lmic/lmic.c:2142
0x400f308c: os_runloop_once at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI LoRaWAN LMIC library/src/lmic/oslmic.c:157
0x400f4d6b: Arduino_LoRaWAN::loop() at F:\myproject/.pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/lib/arduino_lorawan_loop.cpp:42
0x400d5c03: Lora::loop() at F:\myproject/src/lora.cpp:156
0x400d680e: loop() at F:\myproject/src/main.cpp:196
0x400fb8c4: loopTask(void*) at C:/Users/Oliver/.platformio/packages/framework-arduinoespressif32/cores/esp32/main.cpp:46

Related source code

https://github.com/mcci-catena/arduino-lorawan/blob/caceeb49b1f5520d11668aeb7c587ef1a96ca9ad/src/lib/arduino_lorawan_sessionstate.cpp#L190

https://github.com/mcci-catena/arduino-lorawan/blob/caceeb49b1f5520d11668aeb7c587ef1a96ca9ad/src/lib/arduino_lorawan_sessionstate.cpp#L114

https://github.com/mcci-catena/arduino-lorawan/blob/caceeb49b1f5520d11668aeb7c587ef1a96ca9ad/src/Arduino_LoRaWAN.h#L359-L371

terrillmoore commented 2 years ago

@olicooper sorry for the trouble. Thanks for the detailed report. Must be an portability problem revealed by the new compiler. The code looks correct, so you might have found a compiler bug. If you can get the assembly-language instructions preceding the fault, it will help with diagnosis. I don't have a test platform, unfortunately.

olicooper commented 2 years ago

Thanks for the quick response. I will have to look in to how I can the the ASL for you because I don't have a debugging facility for the device and I haven't done it before. I created a simple platformio project earlier today which shows how I got Arduino v2.0.0 working with PlatformIO which might help you. The readme has detail on the compiler versions etc.

terrillmoore commented 2 years ago

We haven't tested on Arduino v2.0.0, and I unfortunately don't even use PlatformIO, though of course we support it -- it doesn't have support for our BSPs. And.. this is the first I've heard of Arduino 2.0. So I'm really out in the weeds and not able to invest much time short term.

You can possibly use objdump on the file arduino_lorawan_sessionstate.o to get the disassembly. But it depends on the tool chain. You'll have to research the command line switches on your version. objdump will be with gcc. There is an option to disassmble, and to include line numbers, and possibly even to include source code. The .o file will be somewhere in your build directory.

terrillmoore commented 2 years ago

I see this is Arduino v2 -- which is still beta. That's good, because if you can identify a compiler issue, they can back up to a working release or try to get a fix. Since freq is a uint8_t &, the compiler should probably not be making assumptions about alignment; but from the fault is clearly is. (It's doing a store-halfword to an odd addresss, almost certainly.)

This happens from time to time; the compiler writers find a new trick that ought to work, is arguably portable; and then we find out where that breaks existing code. There is in fact a union involved in this part of the code, and it's possible that this is confusing the compiler or me into thinking that an operation is safe. No pointers involved so I think the compiler has dropped a stitch, but it's possible I've once again been too clever by half.

Were there no templates involved, I would say that you should try splitting out the inline (since you're on an ESP32, there's no real need). But... there's a template... no wonder the compiler is doing the wrong thing.

But the template and channel numbers were only to do type checking, so we can refactor.

In the header file:

// outside the template, e.g. at line 275:
static bool setFrequencyRaw(uint8_t *pFreq, unsigned iCh, uint32t_t frequency);

Then change setFrequency:

bool setFrequency(uint8_t (&freq)[nCh * 3], unsigned iCh, uint32_t frequency)
    {
    // check the template parameter while it's in scope.
    if (iCh > nCh)
           return false;

    // call an external function to make it harder for optimizer to do the wrong thing (for test)
    return Arduino_LoRaWAN::setFrequencyHelper(freq, ich, frequency);
    }

And then in a separate C++ file:

// this is the former body of setFrequency(), with &freq changed to pFreq
/* static */
bool Arduino_LoRaWAN::setFrequencyRaw(uint8_t *pFreq, unsigned iCh, uint32_t frequency)
                        {
                        const uint32_t reducedFreq = frequency / 100;
                        if (reducedFreq > 0xFFFFFFu)
                                return false;

                        auto const chPtr = pFreq + iCh * 3;
                        chPtr[0] = uint8_t(reducedFreq >> 16);
                        chPtr[1] = uint8_t(reducedFreq >> 8);
                        chPtr[2] = uint8_t(reducedFreq);
                        }

If this fixes the problem, it may make it easier for the compiler people to either tell me what I've gotten wrong, or for them to figure out what has gone wrong.

terrillmoore commented 2 years ago

Sorry found an error in review; moved an if from the external function to the inline.

olicooper commented 2 years ago

I've managed to generate some assembly code for you but there are too many lines to strip sensitive info so I have emailed them to you instead. I hope that is okay?

How to get assembly code

For community reference, I was able to use Espressif's SDK to get the assembly code. It looks like it outputs the source code too. There are multiple ways of doing it but I used the following methods.

PlatformIO (the one I used)

  1. Add build flags to your [env:<env_name>] in platformio.ini
    build_flags = 
           -save-temps
           -fverbose-asm
  2. Clean and build as usual! You'll see a tonne of .s/.i/.ii files in the project root directory.. I moved them with a PS command like `mv .i,.ii,.s .\disassembly`

Command line

Either run the command on the firmware.elf or on an individual object file (but this won't include all code after linking?)..

  1. <espressif_sdk_dir>\toolchain-xtensa-esp32\bin\xtensa-esp32-elf-objdump.exe -S -d "<project_build_dir>\MCCI Arduino LoRaWAN Library\lib\arduino_lorawan_sessionstate.cpp.o" > disassembly.txt
terrillmoore commented 2 years ago

@olicooper email is fine, but it hasn't come yet. Checked spam filters also, nothing there.

olicooper commented 2 years ago

Sorry about that, I think gmail is causing the issue. I've resent it.

terrillmoore commented 2 years ago

got it.

terrillmoore commented 2 years ago

can you send the corresponding link map?

terrillmoore commented 2 years ago

The problem is not what I thought. The code for doing the store is correct. Removing the debug .loc stuff:

# .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:368:                         chPtr[0] = uint8_t(reducedFreq >> 16);
    s8i a4, a2, 0   # *chPtr_113, tmp200
# .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:369:                         chPtr[1] = uint8_t(reducedFreq >> 8);
    s8i a12, a2, 1  # MEM[(unsigned char *)chPtr_113 + 1B], tmp201
# .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:370:                         chPtr[2] = uint8_t(reducedFreq);
    s8i a3, a2, 2   #, reducedFreq

Those are three byte writes to the pointer. So I need the link map to find exactly where things are failing.

But... could this be related to initialization and switching back and forth (described in other report)? The LMIC does not do much checking for invalid state, and since the compiled code is correct, my guess is that some preconditions we are depending on are not true.

olicooper commented 2 years ago

In my code I read from NVS storage during boot to decide if I should start the lora 'module' or not. If I have enabled the module it will start normally by calling the required Arduino_LoRaWAN::begin(&pinMap); function, then calling loop() as normal. The only other thing I do is test if there is internet connection and then start the lora module if there isn't - the idea being that it can act as a backup communication method. However, there is no runtime switching happening - I have to reboot the device to switch it on/off. I should note that the code works fine with Arduino 1.0.6. I have sent you an updated email with all output. Let me know if there is anything else you need me to do. Thank you!

olicooper commented 2 years ago

To simplify things I have created a simple reproducible example for you which can be found here on the lorawan-test branch: https://github.com/olicooper/arduino-v2-esp32-platformio/tree/lorawan-test I will also email you the elf and assembly code.

========= BOOT INFO =========

MAC: FF:FF:FF:FF:FF:FF
BootPart: offset=0x10000 size=0x140000
IDF version: v4.4-dev-2313-gc69f0ec32
Free memory: 241704 bytes

=============================

NetBeginRegionInit
Publishing state
Guru Meditation Error: Core  1 panic'ed (LoadStoreError). Exception was unhandled.

Core  1 register dump:
PC      : 0x400d5e97  PS      : 0x00060330  A0      : 0x800d5f17  A1      : 0x3ffb25b0
A2      : 0x40013b1e  A3      : 0x00000000  A4      : 0x00000000  A5      : 0x000028d8
A6      : 0xfffffffc  A7      : 0x00000003  A8      : 0x00000000  A9      : 0x3ffc0f94
A10     : 0x00000010  A11     : 0x00000016  A12     : 0x00000000  A13     : 0x00040df7
A14     : 0x63ffffff  A15     : 0x00000003  SAR     : 0x0000001b  EXCCAUSE: 0x00000003
EXCVADDR: 0x40013b1e  LBEG    : 0x400d5e48  LEND    : 0x400d5ea6  LCOUNT  : 0x00000004

Backtrace:0x400d5e94:0x3ffb25b00x400d5f14:0x3ffb25e0 0x400d5aeb:0x3ffb26e0 0x400d5bbe:0x3ffb2700 0x400d5c00:0x3ffb2720 0x400d1855:0x3ffb2740 0x400d2d89:0x3ffb2760 0x400d3086:0x3ffb2780 0x400d40bc:0x3ffb27a0 0x400d5d3b:0x3ffb27c0 0x400d14a9:0x3ffb27e0 0x400d15fe:0x3ffb2800 0x400d7900:0x3ffb2820
  #0  0x400d5e94:0x3ffb25b00 in Arduino_LoRaWAN::SessionChannelMask_EU_like<16u, 4u>::setFrequency(unsigned char (&) [48], unsigned int, unsigned int) at .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:369
      (inlined by) Arduino_LoRaWAN::SessionChannelMask_EU_like<16u, 4u>::setFrequency(unsigned char (&) [48], unsigned int, unsigned int) at .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:359
      (inlined by) Arduino_LoRaWAN::BuildSessionState(Arduino_LoRaWAN::SessionState_u&) const at .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/lib/arduino_lorawan_sessionstate.cpp:114
olicooper commented 2 years ago

I know you said that the problem wasn't what you originally thought, but I tried the modifications you suggested above https://github.com/mcci-catena/arduino-lorawan/issues/181#issuecomment-944688542 and it fixed the issue!.. so I have updated the lmic code here https://github.com/olicooper/arduino-lorawan/tree/arduino-v2-fix and I will try the changes in my main project as I haven't fully tested it yet. Great work! What's the next step? Is it possible for a patch be written to your library to help in the interim if this is the culprit?

terrillmoore commented 2 years ago

Well, there's no real problem with a patch, but... it looks like a compiler issue to me, and should be brought to their attention, as it seems unlikely that this would only affect my code (out of the whole world). First, however, I can believe, especially as this is a beta project. Since the projects are all open source, it should not be necessary to do the usual heroic efforts needed to demo a compiler issue to the maintainers. (With global optimization, anyway, tomorrow they could apply the same change globally and crash the library again.)

igrr commented 2 years ago

Hi @olicooper and @terrillmoore, I'd like to point out one thing from the original crash report:

Guru Meditation Error: Core  1 panic'ed (LoadStoreError). Exception was unhandled.

Core  1 register dump:
PC      : 0x400d5e97  PS      : 0x00060330  A0      : 0x800d5f17  A1      : 0x3ffb25b0
A2      : 0x40013b1e  A3      : 0x00000000  A4      : 0x00000000  A5      : 0x000028d8
A6      : 0xfffffffc  A7      : 0x00000003  A8      : 0x00000000  A9      : 0x3ffc0f94
A10     : 0x00000010  A11     : 0x00000016  A12     : 0x00000000  A13     : 0x00040df7
A14     : 0x63ffffff  A15     : 0x00000003  SAR     : 0x0000001b  EXCCAUSE: 0x00000003
EXCVADDR: 0x40013b1e  LBEG    : 0x400d5e48  LEND    : 0x400d5ea6  LCOUNT  : 0x00000004

Here EXCVADDR points to the memory address which the load or store instruction has tried to access. On the ESP32, 0x40013b1e is an address inside ROM, in the middle of .text section. It looks very odd that the code of this library would be performing a load or especially store from this address. It doesn't look like an alignment issue, because even if the alignment was correct, it wouldn't make sense to access (read or write) this ROM location. To me it looks like memory corruption, where some pointer gets overwritten by a garbage value, and then the value gets dereferenced.

@olicooper since you have a reproducer for this issue, could you please upload the ELF file without the workaround and the matching crash dump? I am sorry that we aren't able to reproduce based on your example, because it requires PlatformIO and seemingly a LoRa module...

olicooper commented 2 years ago

@igrr I mentioned on the other ticket that this should be the correct ELF, but I will recompile the elf again against the Arduino main branch to be sure. You are correct that you'd need a LoRa module to execute this code which does make it more challenging. The odd thing about it is that I can compile the exact same code on Arduino v1.0.6 and it works perfectly. Switching to v2.0.0 (and subsequently IDF 4.3) cases the crash.

igrr commented 2 years ago

Thanks, I have checked the ELF you posted, and like @terrillmoore has observed here I don't see anything wrong with code generation, at least as far as alignment is concerned.

The PC at the point of the exception points to:

   0x400d5e97 <+319>:   s8i a4, a2, 0

which is a store instruction, which should most definitely not be trying to write to an area in ROM. I have tried tracing where the corrupted value could come from by looking at the assembly code, but didn't get too far unfortunately.

A compiler bug is a rare occasion but if there is a strong hint that one is present we will definitely investigate this. However, at the moment it looks more like heap corruption. Upgrading from one Arduino-esp32 and IDF release to another can change many things, including the layout of objects in heap. This might cause, for example, an out-of-bounds write to show up in such a way, while previously it could be corrupting some padding space or a mostly-unused variable.

igrr commented 2 years ago

Thanks for posting the two ELF files, I will definitely look more into the differences between them.

olicooper commented 2 years ago

@terrillmoore Quick update, I have tested against an the v4.0.2 release of PlatformIO today. It still seems to be broken, but I get an assert issue instead of a kernel panic now.. https://github.com/espressif/arduino-esp32/issues/5783#issuecomment-1116121168

The assert that is failing is this: https://github.com/mcci-catena/arduino-lmic/blob/8d378ea410887d3fb08ea2c9acce95cc4c047788/src/lmic/radio.c#L1164

terrillmoore commented 2 years ago

The assert issue is due to calling radio_rand1() before the buffer is initialized. Not sure why that might be happening.