nusrobomaster / PX4-Autopilot

PX4 Autopilot Software
https://px4.io
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Hard fault when running commands through usbnsh #2

Closed chengguizi closed 3 years ago

chengguizi commented 3 years ago

It happens on both Dev A and Dev C board.

Tested on commit https://github.com/nusrobomaster/PX4-Autopilot/commit/16c13a87701af4dc60e1e0041414d5545a2cec4f

The program will run into Hard Fault IRQ, and reboot shortly after. It can be always reproduced by doing the following in nsh:

wqueue_test start
work_queue status
work_queue status

Basically, if status is printed twice, it always triggers a hard fault.

In addition other than the hard fault, there are also semaphore related exceptions, in which debug assertion failed. The problem seems to be that some logic is running in IRQ which should not be. Exact reason is hard to track and know

chengguizi commented 3 years ago

Tips for debugging:

  1. use qconfig instead of menuconfig as it supports global config name search
  2. Hard fault message is produced as dump, so interpretation is required. Look at here.

Basically, we could use objdump -S to extract from the compiled .elf file, and look at the IRQ and user stack dump, to determine the stack of function calls when the hard fault happens. It has been very helpful

chengguizi commented 3 years ago

I have realise that it has a lot to do with CONFIG_NSH_ARCHINIT=y.

Previously, this config is not enable, which makes board_app_initialize() not runned, during nsh start up.

The commit can be found at https://github.com/nusrobomaster/PX4-Autopilot/commit/bf6938272227e438e35c596c6c03bbb03dbdde80 .

Brief booting sequence:

  1. stm32_start.c 's __start():
    • configure clock, initial UART setup, GPIO remap (progress 'A')
    • move initialised data from FALSH to SRAM (progress 'B' and 'C')
    • stm32_boardinitialize() (progress 'F')
    • nx_start() (this is persistent and never returns)

Regarding stm32_boardinitialize, it has logics to reset pins initial values, as well as configure LEDs

/************************************************************************************
 * Name: stm32_boardinitialize
 *
 * Description:
 *   All STM32 architectures must provide the following entry point.  This entry point
 *   is called early in the intitialization -- after all memory has been configured
 *   and mapped but before any devices have been initialized.
 *
 ************************************************************************************/

nx_start() boot sequence

It will perform semapher initilisation, POSIX clock, timer, signal, message queue, pthread, clib etc. During which, up_initialize and board_early_initialize (provided CONFIG_BOARD_EARLY_INITIALIZE) is also called.

Then, it run nx_bringup(), determined by CONFIG_USER_ENTRYPOINT to run the program.

nx_create_initthread apart from the idle thread (CONFIG_INIT_ENTRYPOINT)

It runs board_late_initialize if configured (CONFIG_BOARD_LATE_INITIALIZE), otherwise the entrypoint will start after the sinfo("Starting init thread\n"); printout

Interestingly, if CONFIG_BOARD_LATE_INITIALIZE is configured, the board_late_initialize will be run on a separate thread (more robust full thread, instead of the idle thread).

The rest of the stories should lies in nsh_main() in a separate thread:

  pid = nxtask_create("init", CONFIG_USERMAIN_PRIORITY,
                      CONFIG_USERMAIN_STACKSIZE,
                      (main_t)CONFIG_USER_ENTRYPOINT,
                      (FAR char * const *)NULL);
chengguizi commented 3 years ago

nsh_main()

It resides in platforms/nuttx/NuttX/apps/system/nsh/nsh_main.c

The boot sequence:

nsh_initialize()

boot sequence:

chengguizi commented 3 years ago

Update on the issue:

There seem to be two separate sources of errors:

  1. The Assertion error is caused by enabling CONFIG_DEBUG_ASSERTIONS and CONFIG_PRIORITY_INHERITANCE is turned on and CONFIG_SEM_PREALLOCHOLDERS is 0.

The way to solve is simple, simply turn off CONFIG_DEBUG_ASSERTIONS and the rest of two settings remains. This assertion seems uncessary, refer details at issue https://github.com/nusrobomaster/PX4-Autopilot/issues/6

  1. The actual hard fault offending configuration seems to be CONFIG_ARCH_FPU and perhaps the missing CONFIG_PRIORITY_INHERITANCE config

I have not test reproduce the hard fault. However, with proper priority inheritence turned on and debug assertion turned off, everything seem to work as expected.

This could be verified by commit https://github.com/nusrobomaster/PX4-Autopilot/commit/2a5a6339d15e633b93a4563daf5a814dd622af66

Close this issue for now.