nasa / cFE

The Core Flight System (cFS) Core Flight Executive (cFE)
Apache License 2.0
413 stars 204 forks source link

Stopping an APP that has a locked mutex using CFE_ES_StopAppCmd BUG #2107

Open igorluppi opened 2 years ago

igorluppi commented 2 years ago

Describe the bug An APP cannot be correctly stopped by CFE_ES if the CFE_ES_StopAppCmd is sent in the same moment that the APP has a mutex locked. Although the app is terminated (with errors), when we try to restart it, it doesn't work.

To Reproduce Steps to reproduce the behavior:

  1. Create a mutex in SAMPLE_APP using OS_MutSemCreate
  2. Take the mutex in the SAMPLE_APP RunLoop using OS_MutSemTake
  3. Perform CFE_ES CMD CFE_ES_StopAppCmd using "SAMPLE_APP" as target
  4. See error CFE_ES_CleanupObjectCallback: Call to OSAL Delete Object (ID:327691) failed. RC=-6

Expected behavior

  1. CFE_ES should be able to unlock the mutexes in a target APP in order to close all its resources. 1.1. In case CFE_ES behavior can't do that, how can we handle apps that have mutexes and may be closed in the exactly same moment that its mutex is locked ("Taken") ?

System observed on:

Additional context An example: image

Reporter Info Igor Luppi

skliper commented 2 years ago

Possibly related: https://github.com/nasa/cFE/issues/701, https://github.com/nasa/osal/issues/470 and the fix in https://github.com/nasa/osal/pull/472. Although this change was for binary semaphores, may need a similar "fix" for a OS_MutSemTake.

igorluppi commented 2 years ago

The real problem is trying to create the mutex again when the APP is restarted. In function OS_MutSemCreate: return_code = OS_ObjectIdAllocateNew(LOCAL_OBJID_TYPE, sem_name, &local_id, &record); The return_code is not OS_SUCCESS, its value is OS_ERR_NAME_TAKEN, so basically as the APP did not close properly, the mutex still exists in cFE Table. Any way can we get around that?

skliper commented 2 years ago

There is OS_MutSemGetIdByName... although that doesn't solve the underlying delete issue.

skliper commented 2 years ago

Really the app should clean up after exiting it's run loop that at minimum releases the mutex. See HS as an example (not of mutext cleanup, bun unregistering callbacks and closing child tasks). The CFE_ES_StopAppCmd should cause the CFE_ES_RunLoop to return false and allow the app to clean up.

Runloop: https://github.com/nasa/HS/blob/c5ef5acf65ca346270f721e22620e9321a930075/fsw/src/hs_app.c#L120 Cleanup call: https://github.com/nasa/HS/blob/c5ef5acf65ca346270f721e22620e9321a930075/fsw/src/hs_app.c#L184 Cleanup implementation: https://github.com/nasa/HS/blob/c5ef5acf65ca346270f721e22620e9321a930075/fsw/src/hs_custom.c#L138-L155

This approach is likely cleaner/safer than the possible side effects of OSAL trying to clean up and delete a locked mutex...

Basically you should never be able to return from the main app routine without unlocking the mutex. That's just bad manners.

igorluppi commented 2 years ago