Fix #55, add timeout to SB receive

jphickey commented 3 years ago

Describe the contribution Apps should not pend forever on software bus receive, because they should also periodically check CFE_ES_RunLoop even if no commands were received.

Fixes #55

Testing performed Build and sanity check CFE, confirm sample_app runs correctly

Expected behavior changes Sample App will check CFE_ES_Runloop() (and therefore respond to a request from ES) even if it does not get any message on the software bus.

System(s) tested on Ubuntu 20.04

Additional context As a "best practice" all apps should do something like this, instead of pending forever.

Contributor Info - All information REQUIRED for consideration of pull request Joseph Hickey, Vantage Systems, Inc.

skliper commented 3 years ago

Is this really "best practice" vs just another possible implementation? Could be more confusing in terms of performance monitoring, really just another thing to monitor/manage when the app should be getting messages at a regular rate already, if that regular wake-up isn't happening there's bigger issues than allowing the app to process the run loop. The timeout really is now asynchronous relative to regular scheduling so could set up weird races in certain conditions...

jphickey commented 3 years ago

Is this really "best practice" vs just another possible implementation? Could be more confusing in terms of performance monitoring, really just another thing to monitor/manage when the app should be getting messages at a regular rate already, if that regular wake-up isn't happening there's bigger issues than allowing the app to process the run loop. The timeout really is now asynchronous relative to regular scheduling so could set up weird races in certain conditions...

Disagree on that, it follows the normal schedule unless there is no activity for 1000ms, then it wakes up to check CFE_ES_RunLoop and then continues waiting. It will still follow the precise schedule from SCH if the app is scheduled in this method. There really isn't much downside to this pattern that I can see.

The main point (and why I called it "best practice") is that apps should be checking CFE_ES_Runloop periodically, no matter what other work they are doing - because this is the only way they can "notice" an admin request like CFE_ES_STOP_APP_CC etc.

Sure, one could ensure that the app is also scheduled by SCH to do the same so it never pends forever inside CFE_SB_ReceiveBuffer, but now if SCH has an issue then one will lose the ability to control the app. And also how can you restart SCH itself?

I suppose the comment about the possible extra ticks in the performance log is valid, but if this is a concern we can always move the Entry/Exit to be inside the if (status == CFE_SUCCESS) branch, so it only shows up in perf log when a real command was pending. But I don't think this is a big concern, because it is "honest" - compared to the alternative of not being able to shut down the app from ES command.

skliper commented 3 years ago

Still not buying the justification for additional complexity and promoting as the "best practice". I see it as app developers (or operational app management owner's) choice. If you want to pend, pend. If you want to pend w/ timeout, do that. If you want to just poll your pipe and process low priority work in the background, then do that. It seems like the trade is well known, I don't see a huge benefit to timeout in the sample app unless there's a clear operational need. The app is already periodically checking CFE_ES_Runloop based on the periodic SCH message, it's only off-nominal cases where it wouldn't... and users can select whatever behavior they want based on their app implementation for those off-nominal cases.

skliper commented 3 years ago

Or are we saying every app should nominally check CFE_ES_Runloop at 1 Hz (either by a scheduled message or a time out)? Seems like it's still developer's choice as to how that is satisfied... but if there's a rate at which it should be checked maybe it makes more sense in that context vs if it's a timeout or not?

jphickey commented 3 years ago

Here's the thing - it's important to call CFE_ES_RunLoop(), regardless of whatever else your app my be doing.

That is just the design of CFE - apps need to cooperate in order to respond to an admin request, and this function is the only way that info is communicated to the app. So apps need to make sure they cycle through their run loop fairly regularly.

As a general design principle, the thread that is responsible for calling CFE_ES_RunLoop shouldn't ever "pend forever" on anything. It could be a file read/write, a socket operation, anything.

Rule of thumb should be:

If you pend on anything in main loop, use a timeout
If you want to pend forever on anything, use a child task to do so

This seems pretty simple to me, and makes sure your apps behave even if other apps do not behave (i.e. self sufficient design). Correct behavior of one app in this regard should not depend on external stimuli. But maybe that's just me.

This seemed like a very simple way to ensure that we call CFE_ES_RunLoop and can control that app even if SCH goes out to lunch. I am surprised it has met pushback. I don't really care that much, if you don't like it, drop this PR.

jphickey commented 3 years ago

The absolute maximum period that an app can delay between calls to CFE_ES_RunLoop() is the time delay of the background task between the request and the forced shutdown, which is about 5 seconds in default config. But that doesn't leave any time for a graceful shutdown. So in practice you'd want to use a shorter value. I just tossed in 1 second for example, but it can be anything.

In most non-trivial apps there is more work for the main loop to do than just pend on the SB receive. Even the CI_LAB/TO_LAB do more work here. So a general pattern of putting a timeout here in the main loop, again I can't see how it really hurts anything, I'd rather have it polling extra times than not enough.

skliper commented 3 years ago

I don't have a problem with a timeout, certainly doesn't hurt. Although the only "correct behavior" the change seems to supports is the apps response to management requests in the absence of it's nominally scheduled message. If you can pend forever in a child then you are setting up a similar failure scenario which requires forceful cleanup, just at the child level. Why force an app to have a child if the app really should just pend? Again, I have no problem at all providing this as an example or even recommendation, but it's not the only way to do things nor do I see it as a requirement that all apps time out (if you allow children to pend which could require forceful cleanup then why not allow apps to also rely on the forceful cleanup if that's the desired behavior?)

jphickey commented 3 years ago

True it is not the only way to do things, but in general one shouldn't rely on a complex path of dependent logic (i.e. SCH running, sending to the command, successful subscription, etc) in order meet the requirement of checking CFE_ES_RunLoop on a regular basis.

There are lots of ways were relying on SCH (or anything else) to make a forever pend into not being forever can go wrong. The MsgId could be wrong, the subscription can fail for a number of reasons. True that most/any of those conditions are bugs - but does one want to be left in a state where the app has essentially gotten "hung" and we need to force-delete?

I'm not as worried about pending forever in a child task because as long as the main task reacts to the pending ES request, it can do whatever it needs to do to wake/clean up the child task. The point is that the app itself is in a better position to know if its dangerous (or not) to do a force delete than ES is, in the course of a normal shutdown.

This is why I referred to it as "best practice" earlier - not implying there aren't other ways of doing it, but a robust design pattern shouldn't depend on several external apps all being running and properly configured in order make basic, important control operations like CFE_ES_STOP_APP_CC/RESTART_APP_CC/RELOAD_APP_CC etc work as intended.

skliper commented 3 years ago

I do agree it makes sense for sample app based on how it's currently scheduled and how we use it. I also agree any app for which forced delete should be avoided needs to check CFE_ES_RunLoop (regardless of external triggers) at a rate faster than the forced delete timeout. As long as the comments are updated to reflect that I'm fine with the change.

astrogeco commented 3 years ago

CCB:2021-03-31 APPROVED

is the documentation "too forceful"?

astrogeco commented 3 years ago

Update comments. "Best practice" should be to implement requirements and minimize complexity, not always timeout/never pend. There are plenty of cases where a pend is appropriate based on system context/design and comments shouldn't imply otherwise. Requiring a timeout is applicable to a system in which you need to do app management and graceful app cleanup in absence of the "trigger" (note likely still need to handle forceful shutdown/exit via exceptions/resets/power cycles/etc).

@zanzaben was this addressed?

skliper commented 3 years ago

Update comments. "Best practice" should be to implement requirements and minimize complexity, not always timeout/never pend. There are plenty of cases where a pend is appropriate based on system context/design and comments shouldn't imply otherwise. Requiring a timeout is applicable to a system in which you need to do app management and graceful app cleanup in absence of the "trigger" (note likely still need to handle forceful shutdown/exit via exceptions/resets/power cycles/etc).

@zanzaben was this addressed?

Those were my comments. Not addressed yet, I'll provide suggested updated comments (or feel free if anyone else wants to take a shot at it).

astrogeco commented 3 years ago

So, to merge or not to merge?

skliper commented 3 years ago

Don't merge until resolved. Thx.

skliper commented 3 years ago

Marking to ignore until we get the requested change.

astrogeco commented 3 years ago

CCB:2021-04-28 continue postponing

nasa / sample_app

Fix #55, add timeout to SB receive #144