sipeed / M1s_BL808_example

M1s_BL808_example
Apache License 2.0
67 stars 15 forks source link

Un-handled Exception on CPU 2 when running blai_mnist_demo #5

Open mysticalzero opened 1 year ago

mysticalzero commented 1 year ago

To reproduce the issue on the m1sdock board (with camera module and lcd):

  1. ./build.sh blai_mnist_demo
  2. flash the generated bin file onto the m1sdock and copy the mnist.blai file over to models/ on the flash (when connected to the OTG usb port)
  3. press reset and look at the serial output from the c906 and you should see the following:
    
    Starting bl808 now....
    Heap Info: 63338 KB @ [0x0x0000000050225504 ~ 0x0x0000000054000000]
    [OS] Starting aos_loop_proc task...
    [OS] Start c906 xram handle...
    [OS] Starting OS Scheduler...
    init ring:0,tx:0x0000000022020140,rx:0x0000000000000000
    init ring:2,tx:0x0000000022021340,rx:0x0000000022020340
    init ring:3,tx:0x0000000022022540,rx:0x0000000022022340
    init ring:4,tx:0x0000000022022840,rx:0x0000000022022740
    init ring:5,tx:0x0000000000000000,rx:0x0000000000000000
    Init CLI with event Driven

Un-handled Exception on CPU 2: cause: 7, tval = 22022548, epc = 501027ae

x01 = 50128d04 x02 = 50232e30 x03 = a5a5a5a5a5a5a5a5 x04 = 404040404040404
x05 = 4 x06 = f x07 = 707070707070707 x08 = 4
x09 = 4 x10 = 22022548 x11 = 50232ebc x12 = 0
x13 = 22022548 x14 = 1 x15 = 0 x16 = 50141c90
x17 = 50141c96 x18 = 22022540 x19 = 50232eb8 x20 = 22020000
x21 = 22022548 x22 = 4 x23 = 2323232323232323 x24 = 2424242424242424
x25 = 2525252525252525 x26 = 2626262626262626 x27 = 2727272727272727 x28 = 19
x29 = 50141e5e x30 = 50141ef0 x31 = 4



When I use the `blai_mnist_demo.bin` file from here (https://dl.sipeed.com/shareURL/MAIX/M1s/M1s_Dock/7_Firmware/demo_bin/blai_mnist_demo), it is working fine. The issue is apparently only present when I use the one built from this repository.
mysticalzero commented 1 year ago

I'm looking at the issue with the JTAG and the DebugServerConsole program. I got gdb-multiarch up and was able to connect to the debug server. The crash happens during start-up and by the time I connect to the board using gdb, it's already in a crashed state:

(gdb) bt
#0  exception_handler_default (cause=<optimized out>, val=<optimized out>, regs=0x50232c20) at /home/ubuntu/bl808/M1s_BL808_SDK/components/platform/soc/bl808/bl808/evb/src/interrupt.c:65
#1  0x00000000501037ee in trap_c (cause=7, regs=0x50232c20) at /home/ubuntu/bl808/M1s_BL808_SDK/components/platform/soc/bl808/bl808/evb/src/interrupt.c:120
#2  0x0000000050100620 in exception_common () at /home/ubuntu/bl808/M1s_BL808_SDK/components/platform/soc/bl808/bl808/evb/src/boot/gcc/vectors.S:640
(gdb) info thread
  Id   Target Id         Frame 
* 1    Thread 1 (CPU#0)  exception_handler_default (cause=<optimized out>, val=<optimized out>, regs=0x50232c20) at /home/ubuntu/bl808/M1s_BL808_SDK/components/platform/soc/bl808/bl808/evb/src/interrupt.c:65

So, what I did was put a breakpoint in bfl_main() [from M1s_BL808_SDK/components/sipeed/c906/m1s_start/src/start_main.c] in gdb and then access the console corresponding to the e907 before doing a halt_cpu0 followed by release_cpu0. Looking at the console for the c906, I can see that the c906 restarted and crashed as before but didn't trigger any breakpoints which I set. Is that because the half_cpu0 and release_cpu0 reset the hardware breakpoints? I tried with numerous other breakpoints which are sure to hit based on my understanding of the SDK code but to no avail.

What is the best way to debug this? I was trying to find the relevant documentation but couldn't seem to find any. If anyone has any comments on how best to proceed from here, that would be greatly appreciated.

ZoneMR commented 1 year ago

I'm seeing this exact issue too.

@taorye This seems to happen in xram_ring_write, the memcpy call causes the exception when writing to 22022548.

The precompiled .bin seems to write to the same location without issue, so what could be different? Why would a write to the ring buffer when building from source cause a Store/AMO access fault?

ZoneMR commented 1 year ago

It seems this issue is related to a change in newlib from May 2022, which is in recent versions of xuantie-gnu-toolchain.

I was building the most recent toolchain on macOS, and had to revert this commit to avoid the memcpy access fault writing to the XRAM ring buffer:

https://github.com/T-head-Semi/newlib/commit/ec0c0afa59993b3727958964b33753f62c410d39