nasa / PSP

The Core Flight System (cFS) Platform Support Package (PSP)
Apache License 2.0
68 stars 57 forks source link

cES1503 requirement compliance (clear volatile file system) relies on actual power cycle or wrapper logic not included in bundle #211

Open dmknutsen opened 3 years ago

dmknutsen commented 3 years ago

Describe the bug Current requirement verbiage: Upon a Power-On Reset, the cFE shall clear the Volatile File system.

When executing in the context in which a power on reset command does not actually cause a true power on reset (typical when running locally in a development context from a command line) the bundle does not autonomously cause a power cycle or clear the volatile file system due to the risks of doing this on a development platform. A "wrapper" or some other external action is necessary in this context to be compliant with cES1503. Typical real deployments have the wrapper logic to cause a power on reset (which clears the Volatile File system).

Note this also means processor resets and power on resets when running from the command line don't actually restart the software. Typically there's a background service (systemd or similar) that would perform those actions.

To Reproduce On a Linux System:

  1. Place a file in the volatile file system - /dev/shm/osal:RAM/
  2. Start the software with -RPO (Power-On Reset) option.
  3. Verify that the file still exists.

Expected behavior Example wrapper logic should be included to demonstrate a pathway to compliance with this requirement.

Other potential solution - provide a setup (docker? VM?) in which this could be implemented/demonstrated.

System observed on: OS: ubuntu-19.10 Versions: cfe: v6.7.0+dev295; osal: v5.0.0+dev247; psp: v1.4.14.0

Reporter Info Dan Knutsen NASA/Goddard

EDIT - Updated per CCB discussions by Jacob Hageman

jphickey commented 3 years ago

I would disagree that this is really an issue with CFE, though .... perhaps it can be fixed by changing the wording of the requirement or clarifying this a bit better?

The sequence described above is flawed -- a "power on reset" is a hardware function. On a platform like MCP750 it queries a register to get this info, but on pc-linux it believes whatever the user said on the command line.

The sequence in the description doesn't do an actual power-on reset. A real power on reset involves actually restarting the board/VM (i.e. a hardware operation), then the -RPO option only informs CFE that this was done. So in this case it is starting CFE using the PO reset option when no actual power on reset was done.

CFE on pc-linux allows/accepts that for debug purposes (it would be impractical to restart your development box every time you restart CFE) but it isn't really valid/correct.

To clarify - the "clearing" of the volatile FS isn't ever done by CFE - it is done by the hardware, because the content of the volatile FS isn't preserved across a power on reset, by definition. If you actually reset your board/VM, it will be cleared.

skliper commented 3 years ago

We should discuss. The expectation after sending a power on reset command is the next time the software runs (if there is a physical power cycle or not) for the volatile file system to be cleared. Even if for example on linux you start the software with the processor reset flag. Same thing if you stop the software with a processor reset command, and start it with the power on reset flag... the expectation is for the volatile file system to be cleared.

jphickey commented 3 years ago

Yes, and if its implemented properly on a real FSW target, then when the power on reset command is issued, it will result in an actual power-on reset being performed. Not just restarting the CFE process - but a complete power on reset aka "cold" reboot.

On Linux- restarting only the CFE process is akin to a processor reset (aka "warm" boot). The Volatile FS is preserved as expected, because the kernel/board does not reboot, only the CFE process does. (Yes, although termed a "processor" reboot, the physical processor does not reboot in this case - I just consider that a remnant of applying single-process RTOS logic to a multi-process system like Linux, some translations aren't perfect, but the intent/concept is correct, I think).

Note that PC-Linux PSP has (or should have?) a proper exit code to tell a parent system integration process whether to invoke a PR or PO reset. So when the CFE process exits, the parent process, which I would expect to be tied into the Linux init system (e.g. systemd on modern distros), should look at this status:

Restarting CFE core with the "-RPO" flag when no actual power on reset was done is only going to happen on a debug/lab environment when running on development board, or if the system integration layer is broken. If the CFE is properly connected/integrated into a Linux-based init system it should set this flag correctly.

My recommendation is just to clarify that the "clearing" of the volatile disk memory/FS is implemented by the hardware/platform. CFE just uses what it is given.

And I also consider the "system integration process" I'm referring to above to be out of scope for the framework. We provide an example for VxWorks (in psp/fsw/mcp750-vxworks/src/bsp-integration) but don't provide one for linux or RTEMS. I've written one for some projects but it by definition has to be customized for the specific distribution you are using and how it handles init/shutdown/reboot stuff. It has to exist in some form for a real flight-like Linux deployment, but not during development/debug.

skliper commented 3 years ago

It's not always implemented that way. I've had that requirement on projects that a power on reset command restarts the software in the state it would be after a power cycle. It isn't always meant as "do a power cycle", it's reset into the condition you would be after a power cycle. Otherwise you could just do the power cycle... this is sometimes used when you don't want to do a power cycle, but want the software to come up in a known state (not a processor reset with CDS retained, etc).

jphickey commented 3 years ago

So instead of a warm boot or cold boot - something like a tepid boot? :-)

You've got drivers and kernel modules that should all (in theory) get reset with a "power on reset". That is supposed to be the definition of a power on reset - or at least how I've always interpreted a "power on" reset.

If a project hasn't actually implemented "power on reset" that way, that's on them I guess ... but for the purposes of CFE requirements I think this middle-ground just complicates matters and introduces more variables. For instance, how would you be sure if you've cleared/reset everything that would normally be done on a power on reset, without actually doing the power on reset? What about resources with kernel persistence beyond just the RAMDISK? A real deployment probably uses resources such as ADCs and GPIO that the CFE core couldn't reset, because it doesn't directly know about them.

My recommendation is to keep it simple - power on reset should be a complete power on reset, defined as a full restart of the CPU+board.

If a project wants to do something else, that's fine - they can implement it in their BSP integration layer (the one I described above) and make it not actually do the reboot, but just clear out what they think needs to be cleared (e.g. the RAMDISK) and restart only the CFE process. I would discourage it, but it can certainly be done.

skliper commented 3 years ago

Either way, I agree what the "system integration process" does is out of scope... and also I'm aware it really wouldn't be cleaning up all resources that would occur on a real power cycle. But I would expect restarting the cFE after a power on reset command is sent, or with the -RPO flag the volatile memory system should be in a pristine state (whatever it would look like after a power on). Couldn't the software be updated to meet that expectation? Doing anything less sounds like a requirements failure to me, and the requirement is very specific (it doesn't say clear all resources, it's specifically the volatile memory).

skliper commented 3 years ago

Or I could be easily convinced this is a PSP thing... not really a cFE requirement. Either way though I think clearing the RAMDISK as part of a power on reset command or startup is the expected behavior. Maybe @jwilmot or @acudmore can fill us in on this requirement?

jphickey commented 3 years ago

I think maybe we just need to clarify the intent/purpose of the -R/--reset command line switch on pc-linux?

In my interpretation, this is a mechanism for the BSP or "system manager process" to communciate to CFE core the type of reset that it already did. Not a method of telling CFE what to do. CFE startup might be "informed" of what to expect (i.e. whether to even try re-using a CDS/ramdisk/etc) but it doesn't take an active role. The system manager/systemd/etc does the work of reset and tells CFE what to expect.

skliper commented 3 years ago

CFE startup might be "informed" of what to expect (i.e. whether to even try re-using a CDS/ramdisk/etc)

Right. When given the power on reset command line switch, the CFE should not try re-using CDS/ramdisk/etc. That's the failure here as I see it. Similarly if a power on reset command is sent, I'd expect even if CFE is sent the processor reset command line switch the next time it is started it shouldn't be able to find the CDS/ramdisk (the user requested power on, so no artifacts should be left over).

jphickey commented 3 years ago

Here's another take:

One system I've worked on stored its data in a "vault" of sorts which had EDAC codes for protection. There was no conventional non-volatile storage. When you booted, it would effectively unpack the application into the ramdisk, i.e. pre-populating the data, and THEN start CFE. So if CFE wiped this out during its boot, that would clobber all the apps/tables that had been put there.

I still maintain that the BSP/PSP should be responsible for clearing the data, not CFE. CFE should use what it is given by the PSP here.

jphickey commented 3 years ago

Also -- a big reason why I recommend keeping it the way it is on Linux, is because the RAMDISK is mapped directly into the filesystem. There is an inherent danger in doing something equivalent to rm -rf at boot time ... It is supposed to be at /dev/shm/OSAL:<something>... but what if there was a bug and you removed the entirety of /dev or $HOME or /? Bad stuff.

Sure, we could implement an rm -rf at CFE boot time - but is that really a good idea, if we don't really really need to do it? It's a dangerous operation. Much safer to do punt this problem to the BSP/system integrator level, rather than have CFE core do this and risk blowing away users home directories on their workstations.

skliper commented 3 years ago

I do agree it's a PSP thing. If you are running cFS with privileges, I sure hope you aren't storing your bitcoin wallet on that machine...

jphickey commented 3 years ago

No privileges even required ... I usually run CFE as my regular user, so its capable of deleting anything owned by me. I'm generally uncomfortable with anything doing recursive/wholesale delete operations.

If someone wants to do a purge like this on a PSP that it is limited to a FSW target - that's fine - but doing it within CFE on general dev systems that are highly likely to be storing other data is just unnecessary.

So if the requirement can be pushed to the PSP - that's good (sounds like we agree on that).

astrogeco commented 3 years ago

CCB 2020-10-21

We can clarify this as missing behavior on Linux. Our PSP doesn't do it because of the expectation that Linux is used for development and clearing these files would be dangerous to do in that environment.

How to best communicate that this does not work on Linux? VDD? Event message?

This should be handled by a "wrapper" function that the user provides. How to best document or share this knowledge?

jphickey commented 3 years ago

I would not even say "Linux" - as its entirely possible to make a Linux system that is compliant. It just isn't compliant if you are running CFE core from the command line on your desktop. You can also make a VxWorks or RTEMS system non-compliant if you didn't invoke the power on reset routine where indicated on those platforms, either.

Rather than saying "Linux" - which is misleading - I'd rather say "development environment" or something of that nature.

skliper commented 3 years ago

Transferred to PSP and attempted to update as suggested by CCB. Open for comments/suggestions, and I suggest this caveat be included in test reports (and touched on in VDD). We haven't done test reports for Linux in the past... but I'd like to get there (along with RTEMS and the typical VxWorks test report).

jphickey commented 3 years ago

I would suggest if generating test reports against Linux targets, to do so by running CFE on a dev board like a Raspberry Pi or BBB, or within a VM which allows for a more flight-like setup.

For instance, one can use the Yocto build system to generate a Linux base image for one of their example qemu-based targets (or RPi), then deploy CFE build to that system, and run the tests against that target.

Notably, this permits commands that require psp/system support, such as restart, to work correctly.

Running the cfe core executable directly from a command line on a dev box isn't really a valid-enough deployment scenario to test these more system-oriented commands.

skliper commented 3 years ago

Actually we just discussed this today in a test meeting, and it would be fairly easy to fix even in the development environment when using CTF as long as a true power cycle isn't necessary (which it's not if the wrapper logic clears the volatile file system). I'd rather not add complexity to the CTF test development at this point, since we benefit from keeping it as simple as possible. Doing your suggestion from a real (or at least more realistic) testing perspective is a good goal to continue pursuing (along with RTEMS and VxWorks), it's just not a priority for the cert work. Either way, lets retain this issue since it is a behavior on a dev box that will not meet this requirement and I think it's valuable to communicate this non-compliance.