Help needed: table_zero crash during start up

mikewolfram commented 4 years ago

Hi,

I'd need some help figuring out what's going wrong on my L431. I have class which has some larger data members. During start there is a crash and the PC goes to 0xfffffff9. What could be the reason, how to trouble shoot it? Compiling debug with -O0, code size is 84kB, data size 12kB.

Actually it is a protocol handler for SECS messages, here are the failing data members:

struct Message {
    enum class
    FormatCode : uint8_t {
        List = 0,
    Binary = 0x20,
    Boolean = 0x24,
    Ascii = 0x40,
    Jis8 = 0x44,
    Unicode16 = 0x48,
    Signed8Byte = 0x60,
    Signed1Byte = 0x64,
    Signed2Byte = 0x68,
    Signed4Byte = 0x70,
    Float8Byte = 0x80,
    Float4Byte = 0x90,
    Unsigned8Byte = 0xa0,
    Unsigned1Byte = 0xa4,
    Unsigned2Byte = 0xa8,
    Unsigned4Byte = 0xb0,
    };
    static constexpr uint8_t i(FormatCode format) { return uint8_t(format); }

    struct Header {
    uint16_t deviceId;
    uint8_t streamId;
    uint8_t functionId;
    uint16_t blockNummber;
    uint32_t systemBytes;
    } __attribute__((packed));

    // some methods...
    uint8_t length;
    using Data = std::array<uint8_t, 256>;
    Data data;
    uint16_t crc;
};

class ProtocolHandler : public mode:pt:Protothread
{
public:
    // ...
private:
    // ...
    modm::atomic::Queue<Message, 10> rxQueue;
    modm::atomic::Queue<Message, 10> txQueue;
    // ...
};

When I reduce the size of the queue to 1 it runs without issues.

salkinium commented 4 years ago

Perhaps you're accessing a memory that's not enabled yet? I remember the L4 having some pretty advanced power features, where you could disable parts of SRAM power?

The .table_zero section is assembled by the linkerscript, perhaps you can move it (=the bss section, not the table section) around and see what else fails?

salkinium commented 4 years ago

Note that you can use the :platform:fault module for post-mortem debugging, or set a breakpoint on HardFault_Handler. Unfortunately I didn't yet get to write a GDB plugin that interprets the hardfault registers to give you a fault description.

salkinium commented 4 years ago

Also double check the linkerscript memory layout, does the SRAM start at the right address, is it the right size, are the SRAM123 blocks one continous block?

chris-durand commented 4 years ago

This might sound stupid, but did you accidentally put the data on the stack, e.g. by declaring a ProtocolHandler variable inside a function? The 5k data would exceed the default 3096 byte stack size and crash the program.

mikewolfram commented 4 years ago

This might sound stupid, but did you accidentally put the data on the stack, e.g. by declaring a ProtocolHandler variable inside a function? The 5k data would exceed the default 3096 byte stack size and crash the program.

Indeed, I did this earlier this morning. But that was rather easy to figure out. Now I moved the lines just in front of main() function to make them global.

I modified the table_zero() function and have a conditional breakpoint. Once dest is 0x20002bf8 it suddenly becomes 0. Looking further...

mikewolfram commented 4 years ago

Strange, the variable dest of the table_zero() function is at address 0x20002bf8. So this explains why it suddenly becomes 0. I was expecting it to be on the stack, which is 0x20000000 to 0x20000bec.

// Zeros the section defined by a table of {start, end}
static inline void
table_zero(const uint32_t *const start, const uint32_t *const end)
{
    uint32_t **table = (uint32_t **)start;
    while(table < (uint32_t **)end)
    {
    uint32_t *dest = table[0]; // destination start
    while (dest < table[1])     // destination end
        *(dest++) = 0;
    table += 2;
    }
}

mikewolfram commented 4 years ago

The stack pointer is initialized to 0x2002c20 when I set a break point in ResetHandler. But __main_stack_top is at 0x20000be0. Any ideas?

mikewolfram commented 4 years ago

Ok, added this as the first instruction in the Reset_Handler and everything works fine:

ldr sp, =__main_stack_top

salkinium commented 4 years ago

WAT?!?

The Cortex-M loads the stack pointer in hardware from the first entry in the vector table. This should not be necessary and why would this become a problem by just adding a bunch of static data?

Lemme check if the sp is actually __main_stack_top on the hardware I have lying around here.

mikewolfram commented 4 years ago

I had a look into STM32 examples and the first thing they do is to load sp with the stack pointer. Until now I had the same idea, that the hardware would initialize sp itself.

Also, now I set sp my application runs just fine. Before I was facing another issue, where I could receive and answer a single message only, the second message caused the system to lock up.

salkinium commented 4 years ago

Could this be an issue with -O0 vs -Os? I remember the compiler using the stack a lot more in -O0.

If it fixes the issue, then lets put the instruction there, it can't hurt #FamousLastWords.

salkinium commented 4 years ago

This is very weird, can you check that the first 4 bytes of your binary (scons bin) are __main_stack_top?

mikewolfram commented 4 years ago

Yes, it is.

─> hexdump -C -n8 rfid_controller.bin 
00000000  e0 0b 00 20 c1 dc 00 08                           |... ....|

Asking Google showed, that it seems to be a requirement to set SP. E.g.: http://www.keil.com/support/man/docs/armclang_intro/armclang_intro_tnq1505904718328.htm

salkinium commented 4 years ago

This example does not apply to Arm®v6‑M and Armv7‑M profiles

Yes, but that was the old ARM7, ARM9 crap, the new stuff explicitly sets the stack pointer, that was a huge selling point.

I had a look into STM32 examples and the first thing they do is to load sp with the stack pointer.

Because firstly 1) they are idiots #ButNotMe #ImBRILLIANT and 2) they copied their startup code and also their linkerscript code (literally has old .glue sections in it for Mode changes, a non-existant feature on Cortex-M, you only ever run in Tumb mode) from decades ago without thinking and finally 3) they are idiots.

Which is why this personally irks me, because they cannot be right, WE CANNOT LET THEM WIN!1!!

mikewolfram commented 4 years ago

Reading the L431 reference manual it is said:

..., the CPU fetches the top-of-stack value from address 0x0000 0000, thens starts code execution from the boot memory at 0x0000 0004.

In my case I have the 0x2000 2c20 at 0x0000 0000, that would explain the wrong stack pointer. Application in flash starts from 0x0800 0000, where the correct stack pointer is found.

Is it because I'm running from debugger? Or do I need to set some option bit?

salkinium commented 4 years ago

Yeah, but section 2.6 Boot configuration also says just below:

Boot from main Flash memory: the main Flash memory is aliased in the boot memory space (0x0000 0000), but still accessible from its original memory space (0x0800 0000). In other words, the Flash memory contents can be accessed starting from address 0x0000 0000 or 0x0800 0000.

The question is why would this differ? is 0x0000'0004 the same value as 0x0800'0004? Cos that would be the Reset Handler.

Screenshot 2020-05-05 at 22 22 47

Is the stack pointer perhaps mistakenly taken from system memory? (0x1FFF 0000) Where does the 0x20002c20 come from?!?

mikewolfram commented 4 years ago

Well, seems to be related to the STLINK. It seems to set the system to boot from system memory. When I compare 0x0000 0000 with 0x1fff 0000 they are the same. The stack pointer is initialized from there.

Looks to me, starting without debugger maps flash to 0x0000 0000, running from debugger it maps 0x1fff 0000 to 0x0000 0000. Question is now, if there's a difference between STLINK and OpenOCD. My set up might look a bit strange to you, using Eclipse with STM32Cube plugin on macOS. :-)

mikewolfram commented 4 years ago

Tried with OpenOCD, also mapped 0x1fff 0000 to 0x0000 0000.

salkinium commented 4 years ago

I just tested this on the STM32L476 disco board, and for p/x *0x1fff0000 I get 0x200030d0, which I assume is bigger, cos the bootloader can do more for the larger device.

Tried with OpenOCD, also mapped 0x1fff 0000 to 0x0000 0000

Hm, interesting. I cannot reproduce your issue with my OpenOCD (~two week old HEAD version) at least not on the L476, I've tried every compile flag too.

salkinium commented 4 years ago

The thing is, it's probably not even simple for the debugger to change the system boot behavior. The RM only talks about flash option bytes, perhaps they have been set to boot from system memory? And then the bootloader times out and just calls 0x0800 0004 without setting the stack pointer?

In that case we still need to add the instruction, because even though the bootloader sucks, we can and should fix it.

mikewolfram commented 4 years ago

I could imagine, that they change the option byte to use the built-in programmer to download the application and then jump to 0x0800 0004. Not sure.

I'm working on a custom board. PH3/BOOT0 is pulled low like on the Nucleo boards.

So let's open a PR to set SP? I wonder nobody else came across this issue.

salkinium commented 4 years ago

So let's open a PR to set SP? I wonder nobody else came across this issue.

Yep, lets do it.

salkinium commented 4 years ago

Can you tell me your option bytes value? p/x *0x1fff7800, should default to 0xffeff8aa (RM Section 3.4.1 Option bytes description). I want to try to reproduce this issue with the system memory on my end, I'm… a little… curious.

mikewolfram commented 4 years ago

Mine is 0xffff f8aa.

salkinium commented 4 years ago

Thanks, this was a wild ride!

mikewolfram commented 4 years ago

Thanks, this was a wild ride!

You are welcome. Manchmal kann man gar nicht so blöd denken... :-D

mikewolfram commented 4 years ago

As a side note: This morning I added few lines to my app to print the stack pointer at 0x0000 0000, 0x0800 0000 and 0x1fff 0000 (still on a L431). Running from debugger:

Stack pointer:
0x0000 0000: 20002C20
0x0800 0000: 20000BE0
0x1fff 0000: 20002C20

Without debugger it is like expected:

Stack pointer:
0x0000 0000: 20000BE0
0x0800 0000: 20000BE0
0x1fff 0000: 20002C20

salkinium commented 4 years ago

It would make sense for the debug adapter to use the bootloader API, so that it doesn't need a software update to program new devices, assuming the API doesn't change too much. It's a good idea, but I know exactly that because all the ST examples explicitly set the SP during startup, this issue never occurred in their testing of neither the bootloader, nor the debug adapter.

Bad Software™ is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window, or when you turn on your television. You can feel it when you go to work, when you go to church, when you pay your taxes.

modm-io / modm

Help needed: table_zero crash during start up #393