Abort on restart after maximum resets (Linux w/ mismatched restart request/restart type)

LornDMiller commented 2 years ago

Describe the bug After exceeding the maximum number of unplanned resets allowed per CFE_PLATFORM_ES_MAX_PROCESSOR_RESETS, the system attempts to perform a POR instead of a PROCESSOR reset. Unfortunately this orderly reset fails due to an apparent deadlock and the system eventually times out and calls Abort.

Note that this does not occur when using CFE_ES_ResetCFE, only with CFE_PSP_Restart(CFE_PSP_RST_TYPE_PROCESSOR).

To Reproduce Steps to reproduce the behavior: Modify any app to call CFE_PSP_Restart(CFE_PSP_RST_TYPE_PROCESSOR) on command

I used sample app's noop command Issue the command to trigger the restart and then re-spawn the executable Repeat the restart until the system falls back to a POR reset

Expected behavior Expect a clean POR restart without the 10 second timeout and abort

Code snips

System observed on:

Hardware Intel i7-10870H 64 GB RAM
OS: [e.g. Linux 4.4] Linux -------- 5.18.10-76051810-generic #202207071639~1657252310~21.10~7d5e891 SMP PREEMPT_DYNAMIC Fri J x86_64 x86_64 x86_64 GNU/Linux
Versions [e.g. cFE 6.6, OSAL 4.2, PSP 1.3 for mcp750, any related apps] Latest cFS distribution as of July 28, 2022, modification to sample_app to call CFE_PSP_Restart.

Additional context Stack Trace from running threads at the time of the abort

(gdb) r
Starting program: /home/ldmiller/cFS/build/exe/cpu1/core-cpu1 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
OS_BSP_Initialize():Maximum user msg queue depth = 100
CFE_PSP: Default Reset SubType = 1
CFE_PSP: Default CPU ID = 1
CFE_PSP: Default Spacecraft ID = 66
CFE_PSP: Default CPU Name: cpu1
OS_Posix_GetSchedulerParams():187:Policy 1: available, min-max: 1-99
OS_Posix_GetSchedulerParams():187:Policy 2: available, min-max: 1-99
OS_Posix_TaskAPI_Impl_Init():373:Selected policy 2 for RT tasks, root task = 99
OS_Posix_TaskAPI_Impl_Init():390:Could not setschedparam in main thread: Operation not permitted (1)
[New Thread 0x7ffff7fc1640 (LWP 451164)]
[New Thread 0x7ffff7fbc640 (LWP 451165)]
CFE_PSP: Instantiated software timebase 'cFS-Master' running at 10000 usec
CFE_PSP: Using POSIX monotonic clock as CFE timebase
CFE_PSP: Using MMAP simulated EEPROM implementation
CFE_PSP: Physical RAM access not implemented
CFE_PSP: I/O Port access not implemented
CFE_PSP: EEPROM Range (2) created: Start Address = 7FFFF7B80000, Size = 00080000 Status = 0
CFE_PSP: Normal exit from previous cFE instance
CFE_PSP: Starting the cFE with a PROCESSOR reset.
1980-003-11:57:59.61581 CFE_ES_SetupResetVariables: POWER ON RESET due to max proc resets (Watchdog).
CFE_PSP: Exiting cFE with POWERON Reset status.
CFE_PSP: Critical Data Store Shared memory segment removed
Reset Area Shared memory segment removed
User Reserved Area Shared memory segment removed

Thread 1 "core-cpu1" received signal SIGABRT, Aborted.
__pthread_kill_implementation (no_tid=0, signo=6, threadid=140737353758528) at pthread_kill.c:44
44      pthread_kill.c: No such file or directory.
(gdb) info threads
  Id   Target Id                                      Frame 
* 1    Thread 0x7ffff7fa3740 (LWP 451163) "core-cpu1" __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737353758528) at pthread_kill.c:44
  2    Thread 0x7ffff7fc1640 (LWP 451164) "core-cpu1" __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x5555556ec640 <OS_impl_console_table>) at futex-internal.c:57
  3    Thread 0x7ffff7fbc640 (LWP 451165) "core-cpu1" 0x00007ffff7c4323a in __GI___sigtimedwait (set=set@entry=0x5555556ed658 <OS_impl_timebase_table+280>, info=info@entry=0x7ffff7fbbc80, timeout=timeout@entry=0x0)
    at ../sysdeps/unix/sysv/linux/sigtimedwait.c:61
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=140737353758528) at pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=140737353758528) at pthread_kill.c:80
#2  __GI___pthread_kill (threadid=140737353758528, signo=signo@entry=6) at pthread_kill.c:91
#3  0x00007ffff7c42476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007ffff7c287b7 in __GI_abort () at abort.c:79
#5  0x00005555555a2f93 in CFE_PSP_Restart (reset_type=2) at /home/ldmiller/cFS/psp/fsw/pc-linux/src/cfe_psp_support.c:114
#6  0x00005555555730d9 in CFE_ES_SetupResetVariables (StartType=1, StartSubtype=1, BootSource=1) at /home/ldmiller/cFS/cfe/modules/es/fsw/src/cfe_es_start.c:407
#7  0x0000555555572bdc in CFE_ES_Main (StartType=1, StartSubtype=1, ModeId=1, StartFilePath=0x5555555a6271 "/cf/cfe_es_startup.scr") at /home/ldmiller/cFS/cfe/modules/es/fsw/src/cfe_es_start.c:121
#8  0x00005555555a2a84 in OS_Application_Startup () at /home/ldmiller/cFS/psp/fsw/pc-linux/src/cfe_psp_start.c:432
#9  0x00005555555a517e in main (argc=1, argv=0x7fffffffdeb8) at /home/ldmiller/cFS/osal/src/bsp/generic-linux/src/bsp_start.c:238
(gdb) thread 2
[Switching to thread 2 (Thread 0x7ffff7fc1640 (LWP 451164))]
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x5555556ec640 <OS_impl_console_table>) at futex-internal.c:57
57      futex-internal.c: No such file or directory.
(gdb) bt
#0  __futex_abstimed_wait_common64 (private=<optimized out>, cancel=true, abstime=0x0, op=393, expected=0, futex_word=0x5555556ec640 <OS_impl_console_table>) at futex-internal.c:57
#1  __futex_abstimed_wait_common (cancel=true, private=<optimized out>, abstime=0x0, clockid=0, expected=0, futex_word=0x5555556ec640 <OS_impl_console_table>) at futex-internal.c:87
#2  __GI___futex_abstimed_wait_cancelable64 (futex_word=futex_word@entry=0x5555556ec640 <OS_impl_console_table>, expected=expected@entry=0, clockid=clockid@entry=0, abstime=abstime@entry=0x0, private=<optimized out>)
    at futex-internal.c:139
#3  0x00007ffff7c9ca3f in do_futex_wait (sem=sem@entry=0x5555556ec640 <OS_impl_console_table>, abstime=0x0, clockid=0) at sem_waitcommon.c:112
#4  0x00007ffff7c9cad8 in __new_sem_wait_slow64 (sem=0x5555556ec640 <OS_impl_console_table>, abstime=0x0, clockid=0) at sem_waitcommon.c:184
#5  0x000055555559abbc in OS_ConsoleTask_Entry (arg=0xc0001) at /home/ldmiller/cFS/osal/src/os/posix/src/os-impl-console.c:97
#6  0x00007ffff7c94947 in start_thread (arg=<optimized out>) at pthread_create.c:435
#7  0x00007ffff7d24a44 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
(gdb) thread 3
[Switching to thread 3 (Thread 0x7ffff7fbc640 (LWP 451165))]
#0  0x00007ffff7c4323a in __GI___sigtimedwait (set=set@entry=0x5555556ed658 <OS_impl_timebase_table+280>, info=info@entry=0x7ffff7fbbc80, timeout=timeout@entry=0x0) at ../sysdeps/unix/sysv/linux/sigtimedwait.c:61
61      ../sysdeps/unix/sysv/linux/sigtimedwait.c: No such file or directory.
(gdb) bt
#0  0x00007ffff7c4323a in __GI___sigtimedwait (set=set@entry=0x5555556ed658 <OS_impl_timebase_table+280>, info=info@entry=0x7ffff7fbbc80, timeout=timeout@entry=0x0) at ../sysdeps/unix/sysv/linux/sigtimedwait.c:61
#1  0x00007ffff7c428ac in __GI___sigwait (set=0x5555556ed658 <OS_impl_timebase_table+280>, sig=0x7ffff7fbbd44) at ../sysdeps/unix/sysv/linux/sigwait.c:28
#2  0x000055555559d7f5 in OS_TimeBase_SigWaitImpl (obj_id=524289) at /home/ldmiller/cFS/osal/src/os/posix/src/os-impl-timebase.c:159
#3  0x0000555555598e8f in OS_TimeBase_CallbackThread (timebase_id=524289) at /home/ldmiller/cFS/osal/src/os/shared/src/osapi-timebase.c:422
#4  0x000055555559db03 in OS_TimeBasePthreadEntry (arg=0x80001) at /home/ldmiller/cFS/osal/src/os/posix/src/os-impl-timebase.c:319
#5  0x00007ffff7c94947 in start_thread (arg=<optimized out>) at pthread_create.c:435
#6  0x00007ffff7d24a44 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100
(gdb)

Reporter Info Lorn Miller Red Canyon Engineering & Software

LornDMiller commented 2 years ago

Response from cfs-community mailing list:

This looks like a bug to me - I'd recommend documenting on github as a CFE issue.

CFE_PSP_Restart isn't supposed to be called from the context of the init thread (i.e. that which is running CFE_ES_Main) as it relies on that thread having reached the idle loop. It is only supposed to be called from the spawned tasks.

Joe Hickey

jphickey commented 2 years ago

I was able to confirm this one fairly simply (no code change required!):

start CFE normally (PO mode, default)
CTRL+C (note message about reset count 1/2, shutting down for PROCESSOR restart)
start CFE with --reset PR option (this agrees with last shutdown)
CTRL+C again (note message about reset count 2/2, shutting down for POWERON restart)
start CFE again with --reset PR option (this does NOT agree with last shutdown)
Observe startup failure with abort described here

The first order problem here is that CFE was started in processor reset mode, which did NOT concur with the boot record which requested a poweron reset. This is simply because it was launched with --reset PR and the PSP actually obeyed that request even though the reset area (aka boot record) said it was due for a poweron reset. As a result, the CFE tried to restart itself again via CFE_PSP_Restart().

This seems like a questionable design decision -- somewhat like the age-old idiom of trying the exact same thing over again but expecting a different result. The PSP is did not adhere to the restart type, why would a different result be anticipated by calling CFE_PSP_Restart() again? Seems like a recipe for a boot loop and a non-recoverable system.

My recommendation is that this attempt to restart again should be removed. The fact is that CFE is running, just with the wrong reset type. That is arguably better (and more recoverable by an operator) than a system that has gotten into a boot loop and fails to start at all. Instead, it should just be noted through event reporting that the system started with the wrong reset type.

LornDMiller commented 2 years ago

If we end up in a processor reset loop, perhaps due to some third party "fault protection" type application, I would like to see something make that POR attempt. Perhaps this decision should be deferred to that same third party application, but it was my understanding that CFE_PLATFORM_ES_MAX_PROCESSOR_RESETS was intended to manage this particular functionality.

jphickey commented 2 years ago

The PR/PO reset logic is really outside the scope of CFE, handled by whatever scripts/tools provide the system integration. In the case of "pc-linux", if the CFS is started at boot, this would rely on the wrapper/init script (e.g. systemd unit) doing the right thing - that is, if it is systemd-based then do a full system restart for "poweron" or just restart the CFE service for "processor" reset. This should then pass the right option to the --reset flag when it comes back up.

However when just running on a desktop/command line, none of that actually happens. The PSP will just do the steps according to the passed in --reset option (which is important for testing) in that for PR mode it will not reinitialize the shared mem segments.

To summarize though, YES there should be something in a wrapper/startup script that ensures CFE gets started with --reset PO if there are problems.

But if that isn't there or isn't working, having the possibility of a reboot loop doesn't seem wise.

LornDMiller commented 2 years ago

Works for me.

nasa / cFE

Abort on restart after maximum resets (Linux w/ mismatched restart request/restart type) #2127