Open LornDMiller opened 2 years ago
Response from cfs-community mailing list:
This looks like a bug to me - I'd recommend documenting on github as a CFE issue.
CFE_PSP_Restart isn't supposed to be called from the context of the init thread (i.e. that which is running CFE_ES_Main) as it relies on that thread having reached the idle loop. It is only supposed to be called from the spawned tasks.
Joe Hickey
I was able to confirm this one fairly simply (no code change required!):
--reset PR
option (this agrees with last shutdown)--reset PR
option (this does NOT agree with last shutdown)The first order problem here is that CFE was started in processor reset mode, which did NOT concur with the boot record which requested a poweron reset. This is simply because it was launched with --reset PR
and the PSP actually obeyed that request even though the reset area (aka boot record) said it was due for a poweron reset. As a result, the CFE tried to restart itself again via CFE_PSP_Restart()
.
This seems like a questionable design decision -- somewhat like the age-old idiom of trying the exact same thing over again but expecting a different result. The PSP is did not adhere to the restart type, why would a different result be anticipated by calling CFE_PSP_Restart()
again? Seems like a recipe for a boot loop and a non-recoverable system.
My recommendation is that this attempt to restart again should be removed. The fact is that CFE is running, just with the wrong reset type. That is arguably better (and more recoverable by an operator) than a system that has gotten into a boot loop and fails to start at all. Instead, it should just be noted through event reporting that the system started with the wrong reset type.
If we end up in a processor reset loop, perhaps due to some third party "fault protection" type application, I would like to see something make that POR attempt. Perhaps this decision should be deferred to that same third party application, but it was my understanding that CFE_PLATFORM_ES_MAX_PROCESSOR_RESETS was intended to manage this particular functionality.
The PR/PO reset logic is really outside the scope of CFE, handled by whatever scripts/tools provide the system integration. In the case of "pc-linux", if the CFS is started at boot, this would rely on the wrapper/init script (e.g. systemd unit) doing the right thing - that is, if it is systemd-based then do a full system restart for "poweron" or just restart the CFE service for "processor" reset. This should then pass the right option to the --reset
flag when it comes back up.
However when just running on a desktop/command line, none of that actually happens. The PSP will just do the steps according to the passed in --reset
option (which is important for testing) in that for PR
mode it will not reinitialize the shared mem segments.
To summarize though, YES there should be something in a wrapper/startup script that ensures CFE gets started with --reset PO
if there are problems.
But if that isn't there or isn't working, having the possibility of a reboot loop doesn't seem wise.
Works for me.
Describe the bug After exceeding the maximum number of unplanned resets allowed per CFE_PLATFORM_ES_MAX_PROCESSOR_RESETS, the system attempts to perform a POR instead of a PROCESSOR reset. Unfortunately this orderly reset fails due to an apparent deadlock and the system eventually times out and calls Abort.
Note that this does not occur when using CFE_ES_ResetCFE, only with CFE_PSP_Restart(CFE_PSP_RST_TYPE_PROCESSOR).
To Reproduce Steps to reproduce the behavior: Modify any app to call CFE_PSP_Restart(CFE_PSP_RST_TYPE_PROCESSOR) on command
Expected behavior Expect a clean POR restart without the 10 second timeout and abort
Code snips
System observed on:
Additional context Stack Trace from running threads at the time of the abort
Reporter Info Lorn Miller Red Canyon Engineering & Software