Application performance monitoring and deadlines

PeterFidelman commented 3 years ago

Is your feature request related to a problem? Please describe.

When developing a cFS application, I want to detect and correct performance problems as early as possible.
When integrating a cFS system, I want to observe application timing so I can confirm whether system performance matches my expectations.

Describe the solution you'd like

I would like an easy way to track performance statistics for each cFS application, specifically, last, average and max observed run time.
For systems using SCH (Scheduler), I would also like enough information to trace in detail when each application starts and finishes so I can compare the system's actual timing to the timing that I specified.
I would like all this to be possible without adding special performance instrumentation to the cFS application.

Describe alternatives you've considered

Here are some ways that cFE (and cFS) can measure application performance today:

Approach	Limitation
ES (Executive Services) Task Execution Counter	Tracks # of passes through application's main loop. Does not track detailed timing information.
ES (Executive Services) Performance Log	Tracks detailed timing information, but must be manually maintained (started and stopped) by the application.
HS (Health & Safety) CPU Utilization Monitoring	Tracks overall CPU utilization. Not broken down per cFS application.
HK (Housekeeping) telemetry messages published by each application	An application's HK telemetry message can measure and report any timing information it wants, but this must be implemented by each application, so it's not very consistent!

As a rule, existing methods are limited in that either (1) they do not track detailed timing information, or (2) they require application authors to manually instrument their cFS app and thus are not supported for all apps.

Additional context

Any solution must take into account the fact that a typical cFS application spends a lot of time idle, waiting for Software Bus messages. This means that simply instrumenting the CFE_ES_Runloop() function won't give an accurate sense of how much CPU time is being consumed by even a simple application such as the SAMPLE_APP.

while (CFE_ES_Runloop(...) == true)
{
    status = CFE_SB_ReceiveBuffer(...) // Wait for wakeup message
    // Perform work ...
}

There are many possible solutions. My suggestion is to make CFE_ES_Runloop() fire either an Event or a SB message signaling that each application has reached the top of its main loop (i.e., finished executing). Because application execution is normally triggered by a wakeup message as well, comparing the timing of the two messages allows measurement of application execution time.

time --> 

wakeup message sent to app         X          X          X     
CFE_ES_RunLoop() called by app          Y          Y          Y

application execution time         ^----^     ^----^     ^----^

A "statistics tracking" application could subscribe to both messages, compare their timings, and calculate/report any statistics desired, such as last, average, and max observed run time. Outsourcing calculations to an app means they can be easily customized or disabled per mission without modifying cFE.

Side benefit: deadlines

I am fond of this particular implementation because it easily enables another feature: application deadlines. A deadline is an execution time bound triggering a configurable action. It can also be thought of as a "software watchdog". Deadlines are important because they allow unexpectedly long-running applications to be rapidly detected and can help mitigate the timing impact of such applications on the rest of the system.

Today, the closest analogous feature is HS (Health & Safety) Application Monitoring of the ES Task Execution Counter. This only detects applications that get "stuck" for a long time. Also, HS only monitors counters for liveness and does not check that they are incrementing at the expected rate.

Here is my suggested way to implement deadlines. The Scheduler (SCH) application assigns each scheduled app a deadline of configurable length L. If SCH sends the application a wakeup message at time T, it will expect to receive the application's RunLoop() message by time T+L. When the deadline is reached, if the application is not done, SCH fires a schedule overrun event. The event can be caught and used by HS (Health & Safety) or some other application.

time --> 

wakeup message sent to app         X              X        
CFE_ES_RunLoop() called by app          Y                       Y  

application execution time         ^----^         ^-------------^ 
deadline                           ^........^     ^........^ 
                                       L              L    | 
                                                           |
                                                           v
                                   (ok)         schedule overrun event

Note: I've presented a lot of detail here. I'm not tied to any of the details. My goal is to present a starting point for further discussion of whether these features are useful, and for any resulting implementation to be consistent with cFS's architecture.

Requester Info

Peter Fidelman - Blue Origin

These ideas were originally presented during a talk at Flight Software Workshop 2021 (slides).

skliper commented 3 years ago

Great suggestion! I really like the capability to track in an app, allows for easy extension/customization with a basic capability of last/max/min/average timing with reset/clear and deadline overrun notification. Easy from the cFE side since it's just adding SB messages or an event. I've utilized similar patterns on previous projects (non-cFS) with great success (extremely useful for maintaining performance during development and tracking margins vs deadlines).

skliper commented 3 years ago

IIRC I also did frame relative start/stop time tracking (last, max, min, average). This was a very timing sensitive project w/ tight deadlines and impacts based on integration times in a detector so any shifts had real impacts on data.

jwilmot commented 3 years ago

Agreed, great enhancement. Should be able to do all underneath the cFE "hood" so apps are not aware and require no modifications.

jphickey commented 3 years ago

This seems like an extension of the ES performance log to me. It's already there, does (or could be used to do) most of what is described here. Perhaps just a couple things are missing:

auto start the perf log at boot and run indefinitely
allow per-task perf IDs, currently perf IDs are global and assumed to be used from only a single task, so they can't be put into something like CFE_ES_RunLoop() or CFE_SB_RecieveBuffer directly.
the "trigger" logic can be used issue a callback if certain conditions are met, such as one thing starting before another finishes (e.g. deadline expired case).

PeterFidelman commented 3 years ago

All,

Thank you for the responses! I agree there are multiple ways of implementing this feature, each with its own benefits and drawbacks. @jwilmot's suggestion of making the performance monitoring work "underneath the hood", without special changes to applications, is exactly what I was going for. @jphickey's suggestion of using the performance log is also an interesting place to put the feature, so long as logging "indefinitely" isn't going to cause performance or timing issues of its own.

I'll watch this space to see if the issue gains traction. Maybe I'll get some time to work on a patch myself, but I'm not actually sure exactly when I need this feature. If that day comes, I will submit a patch. However, it's possible that some other project will "beat me to it"! I think it would be useful for anyone who wants to use cFS in a timing-critical use case.

nasa / cFE

Application performance monitoring and deadlines #1214