assert() in main-line code

CDKnightNASA commented 4 years ago

Is your feature request related to a problem? Please describe. In unit tests, we use assert statements.

A lot of our code performs a number of safety checks (pointer is not null, index is less than array size, etc.) I suggest we develop something similar to the UT assert code, and use as much as possible in the main-line code base in lieu of our current safety checks as it will improve readability and support tooling of code analysis.

Describe the solution you'd like Assert library should generate events, return values, and will need to be able to unwind such things as semaphores.

Describe alternatives you've considered Implementing this will introduce risk that we miss a safety check or otherwise mis-translate the check into an assert. Need to develop the library and slowly migrate current safety checks to the library.

Additional context Add any other context about the feature request here.

Requester Info Christopher.D.Knight@nasa.gov

jphickey commented 4 years ago

I would contend that we expend too much effort trying to ensure that the user application isn't misbehaving. The reality is that we've published the API specifications and if a user violates those specifications (by bugs or otherwise) then it is IMO acceptable that the software crashes.

I actually like the idea that we put this error-checking argument validation stuff in a macro which could be turned off for "normal" builds, maybe enabled only during early development. Basically similar to what assert.h does but something that will play nicer cross platforms and with our UT environment.

CDKnightNASA commented 4 years ago

I agree with your comments, Joe. I think that a lot of our error checking code is akin to asserts, so having them being formal assert statements rather than "if (ptr == NULL) { SendEvent(MSG); status=ERROR}" it would be normalized and simplified to an "assert_true(ptr != NULL, MSG, ERROR)" or somesuch.

jphickey commented 4 years ago

Yes - I would take it a step further though and basically abort (crash) if an assert fails, rather than returning early with a code, for many reasons:

In a debug environment, you WANT to be able to get the feedback at the earliest point while you can still do a backtrace to determine what went wrong in post-mortem analysis.
The non-success response code is almost certainly not checked or handled by the app (after all, if it called the API with a bad value, this means its either coded incorrectly or in a bad state itself, it is not reasonable to expect it to "gracefully" handle this error).
Even in the unlikely event that the app does check the error code, the vast majority of the time, it will simply exit itself, further obfuscating the issue because now none of the important state info for debugging is left on the stack. This makes it impossible to do post-mortem analysis. You are then in a state where the developer has to reproduce the issue which is very time consuming and may not even be possible.
In a non-debug environment (flight-like) if the error occurs it means things are already broken. It is very unlikely that the FSW will be operating correctly after such an error occurs, and attempting to continue running will basically be unspecified behavior. So even if we "compile-out" the assert statements in the FSW (non-debug) mode it isn't really changing the problem or making it any worse than it already is. It will be necessary to do at least a processor reset to get things functional/deterministic again, and better to do that sooner rather than later.

skliper commented 4 years ago

I'm a bit concerned with an approach where the core software "crashes" based on bad parameters to an API or lacks checking of those parameters. That really is opposite to many of the design principals, processes, and requirements levied on us for developing certifiable software. Definitely report the error (get your early feedback), and I'd be fine with the option to abort/exit (could redefine the assert actions as needed), but I don't think forcing an exit/abort is a good idea.

jphickey commented 4 years ago

assert statements are optional.

When built in "release" mode they are compiled out. They only have an effect when building in "debug" mode. This is the standard behavior of the standard assert.h and there should be a similar behavior of whatever is implemented here.

The reality is there isn't much value in constantly expending CPU cycles to validate parameter inputs from app code that probably isn't checking/handling the error return code anyway. Again, if the app is buggy enough to pass in a NULL string, is it reasonable to think it will properly handle the OS_ERR_INVALID_POINTER error? I think not. It only obfuscates the issue because the call stack/backtrace is gone by the time things really go awry (which they will) and it cannot be debugged easily.

During early debug, when these problems are most prevalent, the preference should be to catch it early and get a backtrace. During deployment/flight, this should never happen, because the code should be debugged already, so it can be compiled out.

skliper commented 4 years ago

Removing input validation is in direct violation (as interpreted by GSFC) of SWE-134 part g and k from NPR 7150.2C as well as the GSFC software development and unit test standards required by SWE-061, SWE-062. Feel free to argue it with our software division chief engineer, but I will not accept software modification that reduce or eliminate the validation of parameter inputs for APIs or commands.

This really is a non-starter, unacceptable design approach, etc. I'm OK with improving the design related to the checks, which is what I understood the origin of this issue to be.

ghost commented 4 years ago

I have a couple of thoughts:

I was wondering if the standards required parameter validation, but @skliper just answered that question.
I can definitely see the benefit of turning on a debug mode so we can catch errors as early as possible during development. Ideally the reviewed and debugged apps will not pass invalid parameters, but it's still possible to have environmental effects (radiation, degrading hardware, etc) that end up corrupting parameters. In these cases, it is beneficial to have as many breadcrumbs leading back to the original corruption as possible. Still, @jphickey makes a good point of being able to catch the problem via immediate crash/exception vs. returning a parameter validation which could end up with the app exiting and losing the context of the error. Perhaps the work needs to go back into making sure there are meaningful actions taken by the apps in response to the parameter errors? Maybe we should make more use of the syslog to leave the necessary trail for debugging an error in flight?
Do we really gain that much performance by removing parameter validation? ( does not need to be addressed, if it's a non starter from the standards.. ).

The assert based macros could certainly make our parameter validation and error handling more consistent, and it could enable the debug mode that makes catching these development errors easier. Is at least that step compatible with the standards and goals stated in this thread?

CDKnightNASA commented 4 years ago

I am thinking of taking this step-wise...First replacing if(CHECKFAILS) { status=ERR; SendEvent(MSG)} with assert(CHECK, ERR, MSG) standardization, where the behavior is the same as current (sets status, generates event) and then consider whether asserts should fail more aggressively. Right now a lot of our code (like our UT code) has a lot of very similar check logic that makes it verbose and difficult to read and impossible to change the error-handling logic globally.

jphickey commented 4 years ago

I think the point is to macro-ize the behavior associated with the parameter checking, so it can be can be handled consistently but also tuned to the appropriate mode of operation.

If removal of a parameter check violates some coding standard, then use a macro implementation that preserves the check. This would be precisely why we would not use the C library "assert.h" for this but rather have a local definition so more control can be retained. (apologies if the statement regarding "similar behavior" to C library implied compiling it out, I really meant that it it can be tuned as to whether it creates an abort/backtrace or not depending on whether you are in a debug build or not)

Note that parameter checking is hardly conclusive, particularly for pointers. The code usually checks for NULL but in the case of environmental effects causing corruption the likelihood of NULL is pretty low, it is much more likely to result in some other value which is simply pointing to bad memory that will cause a crash anyway.

Integer ranges can be validated, which is why I usually try to prefer this for APIs rather than free-range pointers.

Still - the fundamental issue remains - if some environmental degradation/radiation/etc has caused an app to pass a bad value to an API, why do we think the app is healthy enough to respond to and handle this error response in a sane manner? It is running on the same processor and shares the same memory. What are the chances that the degradation only affected one variable? What if it hit program space and not just the stack/data space item?

Has there ever been any case where an app actually is able to "recover" itself from such a condition as part of its own runtime (i.e. by getting an parameter validation error from a library and cleaning itself up?) The whole premise seems very weak. A processor restart is required in order to bring the system back to operation, I just don't see a way around this.

jwilmot commented 4 years ago

For consideration. From a general safety and robustness point of view, a user app, or command/telemetry interface should not be able to take the core (cFE, OSAL, or PSP) out or otherwise cause a reset. I would break interface parameter checking into broad three categories: 1) Messages interfaces (CMD/TLM) should always be parameter checked for range, and size. These are pretty straight forward. Don’t execute the invalid cmd or use the invalid data and send an event. The core and all apps should do this. 2) Internal nonpublic core APIs don’t need to be checked unless required by standards. Given that #1 has been implemented these can only occurred due to bugs which mean the core function is unavailable, the best recourse is to log (syslog?) and restart since the fault will likely have many side effects as the need function is unavailable. 3) External public core APIs should have more robust checks. Here the core is functioning but a user app has failed. The recourse is to return an error with any return parameters set to safe values (does not matter if the caller looks) log the error (syslog, event, both?) If the calling app was critical the loss of app function should be detected at the system level. This is also where the app function software safety hazard analysis plays in. Apps identified in that analysis should have additional requirements.

Notes: 1) All spaceflight systems have a level of data protection (CRC/EDAC) such that radiation faults are masked or handled in hardware/interrupt software. The core and apps should not be protecting against bit flips. 2) All spaceflight systems that share a memory space are developed to the same NPR Class (A,B,C,…) A bug in one can be a bug in all. That said, good practice is to follow 1, 2, and 3 above. 3) Systems that have different Class levels on the same platform will need to implement partitioning, which then looks like note 2 above within the partition The previous comments all have good points. We should try and have a consistent approach with the somewhat conflicting goals of increasing robustness, and minimizing complexity.

CDKnightNASA commented 4 years ago

All spaceflight systems have a level of data protection (CRC/EDAC) such that radiation faults are masked or handled in hardware/interrupt software.

I am assuming this is true of rad-hard systems. And I'm more inclined to assume SEU's are outside the scope of cFS to handle.

As an aside, when I was at JPL we were discussing how the COTS CPU/memory/microcontrollers fared in rad testing, and they said they did well--they are thinking partly because of the small die sizes and low power, so a hit doesn't cause a cascade (ich bin ein software geek, so I know little of the physics). I'm sure they don't have CRC/EDAC beyond what COTS hardware already includes. 'Of course, this is a high-risk experiment but will be interesting to see how it does. Flights will be brief (~90s) but the captured data will sit in the copter for a Sol so it can charge enough to transmit the data to the rover.

jphickey commented 4 years ago

I completely agree that SEU's/environmental degradation should be outside the scope of CFS.

Correction for memory errors has to exist at the hardware level within the memory controller so it can be synchronized with software access. Any spaceflight qualified hardware must have ECC/EDAC built into its memory subsystem.

The only time CFS plays a part is for apps like a memory scrubber - which IMO the primary purpose of this would be to ensure that the memory is accessed enough to ensure that a correctable error doesn't further degrade into an non-correctable error (many ECC/EDAC systems only perform correction when a memory location is accessed, so having a background task slowly "read" all memory can be beneficial). But the memory scrubber wouldn't actually do the correction - it cannot, and it cannot synchronize itself with other threads that might be running.

In short, if a degradation causes the error to be observed at the software level where CFE/apps are running, it's too late -- the system is too far degraded to recover at this point. We certainly cannot rely on the same software to fix itself.

skliper commented 4 years ago

We need a path forward on API asserts.

skliper commented 4 years ago

Discussed again, separate flag to be able to compile out or use (or action can be redefined) and not dependent on build type.

The ASSERTs are only used for confirming code that should never be broken (null pointers, etc). Things that should never happen.

Alternative - return error codes like we do now... and discussion was they are not exclusive, do both (error return and configurable ASSERT).

Where to put re-definable mission ASSERT logic? Thinking of "module" concept where it is built into the framework.

astrogeco commented 4 years ago

CCB 2020-08-05: See comment above for discussion notes and minutes

skliper commented 3 years ago

We've now got BUGCHECK, BUGREPORT, ARGCHECK, LENGTHCHECK, etc as provided by osapi-macros.h if we wanted to use them (just do OS_printf's... and support aborts for debugging). Another option along with CompileTimeAssert.

nasa / cFE

assert() in main-line code #520