Stopping an APP that has a locked mutex using CFE_ES_StopAppCmd BUG

igorluppi commented 2 years ago

Describe the bug An APP cannot be correctly stopped by CFE_ES if the CFE_ES_StopAppCmd is sent in the same moment that the APP has a mutex locked. Although the app is terminated (with errors), when we try to restart it, it doesn't work.

To Reproduce Steps to reproduce the behavior:

Create a mutex in SAMPLE_APP using OS_MutSemCreate
Take the mutex in the SAMPLE_APP RunLoop using OS_MutSemTake
Perform CFE_ES CMD CFE_ES_StopAppCmd using "SAMPLE_APP" as target
See error CFE_ES_CleanupObjectCallback: Call to OSAL Delete Object (ID:327691) failed. RC=-6

Expected behavior

CFE_ES should be able to unlock the mutexes in a target APP in order to close all its resources. 1.1. In case CFE_ES behavior can't do that, how can we handle apps that have mutexes and may be closed in the exactly same moment that its mutex is locked ("Taken") ?

System observed on:

Ubuntu 18.04 LTS
Version: cFS Caelum Release Candidate 4 (Also tested with Aquila and cFS Bootes Release Candidate 2)

Additional context An example:

Reporter Info Igor Luppi

skliper commented 2 years ago

Possibly related: https://github.com/nasa/cFE/issues/701, https://github.com/nasa/osal/issues/470 and the fix in https://github.com/nasa/osal/pull/472. Although this change was for binary semaphores, may need a similar "fix" for a OS_MutSemTake.

igorluppi commented 2 years ago

The real problem is trying to create the mutex again when the APP is restarted. In function OS_MutSemCreate: return_code = OS_ObjectIdAllocateNew(LOCAL_OBJID_TYPE, sem_name, &local_id, &record); The return_code is not OS_SUCCESS, its value is OS_ERR_NAME_TAKEN, so basically as the APP did not close properly, the mutex still exists in cFE Table. Any way can we get around that?

The mutex name exists in cFE table, but we don't have the id_num to delete it or recover it. Do we?

skliper commented 2 years ago

There is OS_MutSemGetIdByName... although that doesn't solve the underlying delete issue.

skliper commented 2 years ago

Really the app should clean up after exiting it's run loop that at minimum releases the mutex. See HS as an example (not of mutext cleanup, bun unregistering callbacks and closing child tasks). The CFE_ES_StopAppCmd should cause the CFE_ES_RunLoop to return false and allow the app to clean up.

Runloop: https://github.com/nasa/HS/blob/c5ef5acf65ca346270f721e22620e9321a930075/fsw/src/hs_app.c#L120 Cleanup call: https://github.com/nasa/HS/blob/c5ef5acf65ca346270f721e22620e9321a930075/fsw/src/hs_app.c#L184 Cleanup implementation: https://github.com/nasa/HS/blob/c5ef5acf65ca346270f721e22620e9321a930075/fsw/src/hs_custom.c#L138-L155

This approach is likely cleaner/safer than the possible side effects of OSAL trying to clean up and delete a locked mutex...

Basically you should never be able to return from the main app routine without unlocking the mutex. That's just bad manners.

igorluppi commented 2 years ago

Any way can we get around that? Answering my own question, using count semaphore with sem_initial_value = 1 works. It's a mutex but no error is raised on the release of the mutex if the releasing task is not its owner, as the true pthread_mutex implementation does. Moreover, the binary semaphore not only locks in the same situation, but also deadlock cFE service as well, avoid that. I was not able to finish the software with SIGINT (ctrl-c) after trying to stop an APP with a locked binary semaphore.

nasa / cFE

Stopping an APP that has a locked mutex using CFE_ES_StopAppCmd BUG #2107