nasa / cFE

The Core Flight System (cFS) Core Flight Executive (cFE)
Apache License 2.0
408 stars 200 forks source link

Mutex deadlock can cause cFE to hang indefinitely during shutdown (Linux/POSIX) #2433

Open LornDMiller opened 1 year ago

LornDMiller commented 1 year ago

Describe the bug This may be more widespread than simply Software Bus, but Software Bus is where I've seen it the most.

The Software Bus function CFE_SB_BroadcastBufferToRoute locks global data via CFE_SB_LockSharedData, then may call OS_QueuePut, then eventually unlocks the global data via CFE_SB_UnlockSharedData. In OSAL, the OS_QueuePut call finds its way into mq_timedsend, and the locking and unlocking of global data is via pthread_mutex_lock and pthread_mutex_unlock.

During shutdown, the various apps are all cancelled. Again in OSAL, this resolves to pthread_cancel calls. This leads to apps terminating when they reach "cancellation points."

mq_timed_send is a cancellation point. pthread_mutex_lock is not a cancellation point.

When a task is in the process of sending a message and is cancelled, it may be terminated while it holds the SB Shared Data mutex. Occasionally this coincides with another task that is pending or is about to pend on that mutex. Any remaining tasks that pend on that mutex are then deadlocked. I have not yet identified why the abort at the end of CFE_PSP_Restart is not called, but the system hangs indefinitely.

My current work-around is to modify OSAL's os-impl-mutex.c. In OS_MutSemCreate_Impl I use pthread_mutexattr_setrobust to make all mutexes robust just before the call to pthread_mutex_init and in OS_MutSemTake_Impl I check the return code for EOWNERDEAD and, if that is returned, call pthread_mutex_consistent to restore the mutex.

To Reproduce Steps to reproduce the behavior: This may be difficult to reproduce without a lot of Software Bus traffic. With enough software bus traffic, simply restarting the system should be sufficient to trigger this eventually. Unfortunately this is a race condition that is not easily triggered.

Expected behavior On shutdown, all tasks terminate.

Code snips cFE/modules/sb/fsw/src/cfe_sb_api.c line 1548 is the call to CFE_SB_LockSharedData that could trigger the deadlock cFE/modules/sb/fsw/src/cfe_sb_api.c line 1605 (Call to OS_QueuePut) is a cancellation point where that same function has the global data mutex locked. osal/src/os/posix/src/os-impl-queues.c line 305 is the actual call to mq_timed_send osal/src/os/posix/src/os-impl-mutex.c line 179 is the pthread_mutex_lock that ultimately blocks and deadlocks a subsequent task.

System observed on: Ubuntu 22.04. Analysis indicates any POSIX system would be vulnerable. I have not evaluated vulnerability for other platforms.

Additional context This may require coordination with the OSAL project. This ticket may be more appropriate for the OSAL team.

Reporter Info Lorn Miller Red Canyon Engineering & Software

irowebbn commented 11 months ago

I have encountered a similar problem where the software bus appears to have a race condition. Here are the system log messages it emits:

CFE_SB_UnlockSharedData: SharedData Mutex Give Err Stat=-6,App=1114127,Func=CFE_SB_ReceiveBuffer,Line=1892
CFE_SB_UnlockSharedData: SharedData Mutex Give Err Stat=-6,App=1114127,Func=CFE_SB_ReceiveBuffer,Line=2005