Open olicooper opened 2 years ago
@olicooper sorry for the trouble. Thanks for the detailed report. Must be an portability problem revealed by the new compiler. The code looks correct, so you might have found a compiler bug. If you can get the assembly-language instructions preceding the fault, it will help with diagnosis. I don't have a test platform, unfortunately.
Thanks for the quick response. I will have to look in to how I can the the ASL for you because I don't have a debugging facility for the device and I haven't done it before. I created a simple platformio project earlier today which shows how I got Arduino v2.0.0 working with PlatformIO which might help you. The readme has detail on the compiler versions etc.
We haven't tested on Arduino v2.0.0, and I unfortunately don't even use PlatformIO, though of course we support it -- it doesn't have support for our BSPs. And.. this is the first I've heard of Arduino 2.0. So I'm really out in the weeds and not able to invest much time short term.
You can possibly use objdump
on the file arduino_lorawan_sessionstate.
o to get the disassembly. But it depends on the tool chain. You'll have to research the command line switches on your version. objdump
will be with gcc
. There is an option to disassmble, and to include line numbers, and possibly even to include source code. The .o
file will be somewhere in your build directory.
I see this is Arduino v2 -- which is still beta. That's good, because if you can identify a compiler issue, they can back up to a working release or try to get a fix. Since freq is a uint8_t &, the compiler should probably not be making assumptions about alignment; but from the fault is clearly is. (It's doing a store-halfword to an odd addresss, almost certainly.)
This happens from time to time; the compiler writers find a new trick that ought to work, is arguably portable; and then we find out where that breaks existing code. There is in fact a union involved in this part of the code, and it's possible that this is confusing the compiler or me into thinking that an operation is safe. No pointers involved so I think the compiler has dropped a stitch, but it's possible I've once again been too clever by half.
Were there no templates involved, I would say that you should try splitting out the inline (since you're on an ESP32, there's no real need). But... there's a template... no wonder the compiler is doing the wrong thing.
But the template and channel numbers were only to do type checking, so we can refactor.
In the header file:
// outside the template, e.g. at line 275:
static bool setFrequencyRaw(uint8_t *pFreq, unsigned iCh, uint32t_t frequency);
Then change setFrequency:
bool setFrequency(uint8_t (&freq)[nCh * 3], unsigned iCh, uint32_t frequency)
{
// check the template parameter while it's in scope.
if (iCh > nCh)
return false;
// call an external function to make it harder for optimizer to do the wrong thing (for test)
return Arduino_LoRaWAN::setFrequencyHelper(freq, ich, frequency);
}
And then in a separate C++ file:
// this is the former body of setFrequency(), with &freq changed to pFreq
/* static */
bool Arduino_LoRaWAN::setFrequencyRaw(uint8_t *pFreq, unsigned iCh, uint32_t frequency)
{
const uint32_t reducedFreq = frequency / 100;
if (reducedFreq > 0xFFFFFFu)
return false;
auto const chPtr = pFreq + iCh * 3;
chPtr[0] = uint8_t(reducedFreq >> 16);
chPtr[1] = uint8_t(reducedFreq >> 8);
chPtr[2] = uint8_t(reducedFreq);
}
If this fixes the problem, it may make it easier for the compiler people to either tell me what I've gotten wrong, or for them to figure out what has gone wrong.
Sorry found an error in review; moved an if from the external function to the inline.
I've managed to generate some assembly code for you but there are too many lines to strip sensitive info so I have emailed them to you instead. I hope that is okay?
For community reference, I was able to use Espressif's SDK to get the assembly code. It looks like it outputs the source code too. There are multiple ways of doing it but I used the following methods.
[env:<env_name>]
in platformio.ini
build_flags =
-save-temps
-fverbose-asm
Either run the command on the firmware.elf
or on an individual object file (but this won't include all code after linking?)..
<espressif_sdk_dir>\toolchain-xtensa-esp32\bin\xtensa-esp32-elf-objdump.exe -S -d "<project_build_dir>\MCCI Arduino LoRaWAN Library\lib\arduino_lorawan_sessionstate.cpp.o" > disassembly.txt
@olicooper email is fine, but it hasn't come yet. Checked spam filters also, nothing there.
Sorry about that, I think gmail is causing the issue. I've resent it.
got it.
can you send the corresponding link map?
The problem is not what I thought. The code for doing the store is correct. Removing the debug .loc
stuff:
# .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:368: chPtr[0] = uint8_t(reducedFreq >> 16);
s8i a4, a2, 0 # *chPtr_113, tmp200
# .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:369: chPtr[1] = uint8_t(reducedFreq >> 8);
s8i a12, a2, 1 # MEM[(unsigned char *)chPtr_113 + 1B], tmp201
# .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:370: chPtr[2] = uint8_t(reducedFreq);
s8i a3, a2, 2 #, reducedFreq
Those are three byte writes to the pointer. So I need the link map to find exactly where things are failing.
But... could this be related to initialization and switching back and forth (described in other report)? The LMIC does not do much checking for invalid state, and since the compiled code is correct, my guess is that some preconditions we are depending on are not true.
In my code I read from NVS storage during boot to decide if I should start the lora 'module' or not. If I have enabled the module it will start normally by calling the required Arduino_LoRaWAN::begin(&pinMap);
function, then calling loop()
as normal. The only other thing I do is test if there is internet connection and then start the lora module if there isn't - the idea being that it can act as a backup communication method. However, there is no runtime switching happening - I have to reboot the device to switch it on/off. I should note that the code works fine with Arduino 1.0.6.
I have sent you an updated email with all output. Let me know if there is anything else you need me to do. Thank you!
To simplify things I have created a simple reproducible example for you which can be found here on the lorawan-test
branch: https://github.com/olicooper/arduino-v2-esp32-platformio/tree/lorawan-test
I will also email you the elf and assembly code.
========= BOOT INFO =========
MAC: FF:FF:FF:FF:FF:FF
BootPart: offset=0x10000 size=0x140000
IDF version: v4.4-dev-2313-gc69f0ec32
Free memory: 241704 bytes
=============================
NetBeginRegionInit
Publishing state
Guru Meditation Error: Core 1 panic'ed (LoadStoreError). Exception was unhandled.
Core 1 register dump:
PC : 0x400d5e97 PS : 0x00060330 A0 : 0x800d5f17 A1 : 0x3ffb25b0
A2 : 0x40013b1e A3 : 0x00000000 A4 : 0x00000000 A5 : 0x000028d8
A6 : 0xfffffffc A7 : 0x00000003 A8 : 0x00000000 A9 : 0x3ffc0f94
A10 : 0x00000010 A11 : 0x00000016 A12 : 0x00000000 A13 : 0x00040df7
A14 : 0x63ffffff A15 : 0x00000003 SAR : 0x0000001b EXCCAUSE: 0x00000003
EXCVADDR: 0x40013b1e LBEG : 0x400d5e48 LEND : 0x400d5ea6 LCOUNT : 0x00000004
Backtrace:0x400d5e94:0x3ffb25b00x400d5f14:0x3ffb25e0 0x400d5aeb:0x3ffb26e0 0x400d5bbe:0x3ffb2700 0x400d5c00:0x3ffb2720 0x400d1855:0x3ffb2740 0x400d2d89:0x3ffb2760 0x400d3086:0x3ffb2780 0x400d40bc:0x3ffb27a0 0x400d5d3b:0x3ffb27c0 0x400d14a9:0x3ffb27e0 0x400d15fe:0x3ffb2800 0x400d7900:0x3ffb2820
#0 0x400d5e94:0x3ffb25b00 in Arduino_LoRaWAN::SessionChannelMask_EU_like<16u, 4u>::setFrequency(unsigned char (&) [48], unsigned int, unsigned int) at .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:369
(inlined by) Arduino_LoRaWAN::SessionChannelMask_EU_like<16u, 4u>::setFrequency(unsigned char (&) [48], unsigned int, unsigned int) at .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/Arduino_LoRaWAN.h:359
(inlined by) Arduino_LoRaWAN::BuildSessionState(Arduino_LoRaWAN::SessionState_u&) const at .pio/libdeps/heltec_wireless_stick/MCCI Arduino LoRaWAN Library/src/lib/arduino_lorawan_sessionstate.cpp:114
I know you said that the problem wasn't what you originally thought, but I tried the modifications you suggested above https://github.com/mcci-catena/arduino-lorawan/issues/181#issuecomment-944688542 and it fixed the issue!.. so I have updated the lmic code here https://github.com/olicooper/arduino-lorawan/tree/arduino-v2-fix and I will try the changes in my main project as I haven't fully tested it yet. Great work! What's the next step? Is it possible for a patch be written to your library to help in the interim if this is the culprit?
Well, there's no real problem with a patch, but... it looks like a compiler issue to me, and should be brought to their attention, as it seems unlikely that this would only affect my code (out of the whole world). First, however, I can believe, especially as this is a beta project. Since the projects are all open source, it should not be necessary to do the usual heroic efforts needed to demo a compiler issue to the maintainers. (With global optimization, anyway, tomorrow they could apply the same change globally and crash the library again.)
Hi @olicooper and @terrillmoore, I'd like to point out one thing from the original crash report:
Guru Meditation Error: Core 1 panic'ed (LoadStoreError). Exception was unhandled.
Core 1 register dump:
PC : 0x400d5e97 PS : 0x00060330 A0 : 0x800d5f17 A1 : 0x3ffb25b0
A2 : 0x40013b1e A3 : 0x00000000 A4 : 0x00000000 A5 : 0x000028d8
A6 : 0xfffffffc A7 : 0x00000003 A8 : 0x00000000 A9 : 0x3ffc0f94
A10 : 0x00000010 A11 : 0x00000016 A12 : 0x00000000 A13 : 0x00040df7
A14 : 0x63ffffff A15 : 0x00000003 SAR : 0x0000001b EXCCAUSE: 0x00000003
EXCVADDR: 0x40013b1e LBEG : 0x400d5e48 LEND : 0x400d5ea6 LCOUNT : 0x00000004
Here EXCVADDR points to the memory address which the load or store instruction has tried to access. On the ESP32, 0x40013b1e is an address inside ROM, in the middle of .text section. It looks very odd that the code of this library would be performing a load or especially store from this address. It doesn't look like an alignment issue, because even if the alignment was correct, it wouldn't make sense to access (read or write) this ROM location. To me it looks like memory corruption, where some pointer gets overwritten by a garbage value, and then the value gets dereferenced.
@olicooper since you have a reproducer for this issue, could you please upload the ELF file without the workaround and the matching crash dump? I am sorry that we aren't able to reproduce based on your example, because it requires PlatformIO and seemingly a LoRa module...
@igrr I mentioned on the other ticket that this should be the correct ELF, but I will recompile the elf again against the Arduino main
branch to be sure.
You are correct that you'd need a LoRa module to execute this code which does make it more challenging.
The odd thing about it is that I can compile the exact same code on Arduino v1.0.6 and it works perfectly. Switching to v2.0.0 (and subsequently IDF 4.3) cases the crash.
Thanks, I have checked the ELF you posted, and like @terrillmoore has observed here I don't see anything wrong with code generation, at least as far as alignment is concerned.
The PC at the point of the exception points to:
0x400d5e97 <+319>: s8i a4, a2, 0
which is a store instruction, which should most definitely not be trying to write to an area in ROM. I have tried tracing where the corrupted value could come from by looking at the assembly code, but didn't get too far unfortunately.
A compiler bug is a rare occasion but if there is a strong hint that one is present we will definitely investigate this. However, at the moment it looks more like heap corruption. Upgrading from one Arduino-esp32 and IDF release to another can change many things, including the layout of objects in heap. This might cause, for example, an out-of-bounds write to show up in such a way, while previously it could be corrupting some padding space or a mostly-unused variable.
Thanks for posting the two ELF files, I will definitely look more into the differences between them.
@terrillmoore Quick update, I have tested against an the v4.0.2 release of PlatformIO today. It still seems to be broken, but I get an assert issue instead of a kernel panic now.. https://github.com/espressif/arduino-esp32/issues/5783#issuecomment-1116121168
The assert that is failing is this: https://github.com/mcci-catena/arduino-lmic/blob/8d378ea410887d3fb08ea2c9acce95cc4c047788/src/lmic/radio.c#L1164
The assert issue is due to calling radio_rand1()
before the buffer is initialized. Not sure why that might be happening.
The library worked fine on Arduino 1.0.6, but after upgrading to v2.0.0 the device reboots on the next loop cycle after calling
Arduino_LoRaWAN::SendBuffer()
. I have aLora
wrapper class which simply callsArduino_LoRaWAN::loop()
- it is failing at this point.environment:
mem-optimized
branch)code
// ... then 'basicStatePublishJob' calls SendBuffer SendBuffer(dataBuffer, dataBuffer_s, pDoneFn, (void )this, /confirmed/ 0, /port*/ 1);
Related source code
https://github.com/mcci-catena/arduino-lorawan/blob/caceeb49b1f5520d11668aeb7c587ef1a96ca9ad/src/lib/arduino_lorawan_sessionstate.cpp#L190
https://github.com/mcci-catena/arduino-lorawan/blob/caceeb49b1f5520d11668aeb7c587ef1a96ca9ad/src/lib/arduino_lorawan_sessionstate.cpp#L114
https://github.com/mcci-catena/arduino-lorawan/blob/caceeb49b1f5520d11668aeb7c587ef1a96ca9ad/src/Arduino_LoRaWAN.h#L359-L371