nasa / osal

The Core Flight System (cFS) Operating System Abstraction Layer (OSAL)
Apache License 2.0
546 stars 213 forks source link

bin sem test #1261

Closed thesamprice closed 2 years ago

thesamprice commented 2 years ago

Describe the bug Working through the unit tests for a qt port. I hit a stumbling block on bin-sem-test.c function BinSemCheck.

When a task is deleted it may have registered callbacks with resources. In case of qt the task with a binary semaphore (QSemaphore) in a waiting state that gets deleted. Ends up stalling the system when that QSemaphore is written to.

From what i gather free-rtos has a similar issue.

https://www.freertos.org/FreeRTOS_Support_Forum_Archive/August_2016/freertos_Releasing_Resources_on_Task_Exit_Delete_b113e7a9j.html

Releasing Resources on Task Exit/Delete Posted by rtel on August 12, 2016 If a semaphore is held by a task, and that task gets deleted, then there is nothing in the code that will automatically release the semaphore.

Looking at the source code I don’t think there is an easy way around this. The mutex can be reset by passing its handle into xQueueReset(), but then to make it a mutex rather than a queue you would need to call prvInitialiseMutex() too – and that function is not publicly accessible.

Perhaps, if you are 100% sure there are no other tasks blocked on the mutex, it could be deleted then re-created?


void BinSemCheck(void)
{
    uint32            status;
    OS_bin_sem_prop_t bin_sem_prop;

    memset(&bin_sem_prop, 0, sizeof(bin_sem_prop));

    /* Delete the task, which should be pending in OS_BinSemTake() */
    status = OS_TaskDelete(task_1_id);
    UtAssert_True(status == OS_SUCCESS, "OS_TaskDelete Rc=%d", (int)status);

    status = OS_TimerDelete(timer_id);
    UtAssert_True(status == OS_SUCCESS, "OS_TimerDelete Rc=%d", (int)status);

    OS_TaskDelay(100);

    /* Confirm that the semaphore itself is still operational after task deletion */
    status = OS_BinSemGive(bin_sem_id); /* task 1 was pending on this, releasing this semaphore triggers task1 to wake up, but stalls out as it has been deleted */
    UtAssert_True(status == OS_SUCCESS, "BinSem give Rc=%d", (int)status);

A ctrl+c shows backtrace as follows.

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff207bf4ba libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff207f02ab libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_wait + 76
    frame #2: 0x00007fff207ee192 libsystem_pthread.dylib`_pthread_mutex_firstfit_lock_slow + 204
    frame #3: 0x0000000100185e6c QtCore`QWaitCondition::wakeAll() + 18
    frame #4: 0x0000000100183064 QtCore`QSemaphore::release(int) + 48
    frame #5: 0x000000010001236e bin-sem-test`::OS_BinSemGive_Impl(token=0x00007ffeefbfe918) at os-impl-binsem.cxx:247:19
    frame #6: 0x0000000100007f06 bin-sem-test`OS_BinSemGive(sem_id=262145) at osapi-binsem.c:168:23
    frame #7: 0x0000000100004e50 bin-sem-test`BinSemCheck at bin-sem-test.c:175:14

System observed on:

Additional context I subclassed my qt thread from the qt class, and have been using the terminate call. I'll try a different method to see if that allows qt to cleanup semaphores from deleted tasks.

Reporter Info Full name and company/organization if applicable

jphickey commented 2 years ago

Yes, there have been other observed corner cases when deleting a task. In the pthread implementation, this uses condition variables, and pthread_cond_wait is a cancellation point. There is a handler installed via "pthread_cleanup_push" that releases the mutex in the event that the task is deleted while pending on the sem. Sounds like in your Qt port this isn't working the same way.

Make sure you are using something equivalent to "pthread_cleanup_push" ... this is actually part of the POSIX spec that says the mutex is re-locked if a task is canceled during the call to pthread_cond_wait, that this is intended behavior that the mutex will be left in a locked state after task deletion. One must use a cleanup handler to unlock it.

That being said, this is just one example. There are other cases where it is unclear if/when a resource is held by a task and whether that resource (count sem, mutex, etc) should be returned/reset when the task is deleted. There just isn't one right answer, and any answer is subject to possible race conditions.

It may be that this test is checking something that won't work on Qt port if it doesn't have a proper underlying mechanism to handle it. Might have to disable this test with that caveat.

In short, suffice to say that deleting tasks is hard ... better to design the software to have the task self-exit when done, or better yet just "park" idle worker tasks and not delete them. Most FSW does not delete tasks for this reason, so it might not be that big of a deal to ignore this failure.

thesamprice commented 2 years ago

@jphickey Thanks for the info.

For disabling the test I added a test_1_running volatile flag, and the following logic. Is there an appropriate config file to specify if the OS can handle doing cleanup on task deletions? and then add appropriate logic if the OS cleans up on task deletes. A desired name for this?

Something like this lets my test work.

    /* Delete the task, which should be pending in OS_BinSemTake() */
    if(OS_TASK_DELETE_SELFCLEANS == 0){
        task_1_running = 0;
        OS_TaskDelay(100); /* Allow task_1 to gracefully stop */
   }

    status = OS_TaskDelete(task_1_id);
    UtAssert_True(status == OS_SUCCESS, "OS_TaskDelete Rc=%d", (int)status);

    status = OS_TimerDelete(timer_id);
    UtAssert_True(status == OS_SUCCESS, "OS_TimerDelete Rc=%d", (int)status);

    OS_TaskDelay(100);

    status = OS_BinSemGetInfo(bin_sem_id, &bin_sem_prop);
    UtAssert_True(status == OS_SUCCESS, "BinSem value=%d Rc=%d", (int)bin_sem_prop.value, (int)status);

    /* if failures occur, do not loop endlessly */
    while (task_1_failures < 20 && task_1_running == 1)
    {

        status = OS_BinSemTake(bin_sem_id);
        if (status != OS_SUCCESS)
        {
jphickey commented 2 years ago

Not sure, this seems to be an oddball case because the 3 supported OS's (POSIX, RTEMS, VxWorks) all pass this test. If the Qt cannot pass it, then any such workaround to the functional test should probably be limited to that port. In a way this also helps to document what is not working correctly in the port.

If Qt does not offer something akin to pthread_cleanup_push, then maybe disable cancellation during the mutex operation and re-enable cancellation once it has released the mutex normally? That would at least ensure it doesn't get canceled in the middle.

thesamprice commented 2 years ago

Under the hood Qt calls pthread_cleanup_push when starting a thread, but I dont see this callback when taking a semaphore.

Ill go ahead and just add it to the osal_config.ini / osal_config.h

QT posix code calls pthread_cleanup_push when starting a thread

osal code calls pthread_cleanup_push when locking a semaphore.

https://github.com/nasa/osal/blob/88f72d788d33a12e7bf9d9f01c77346438730ffa/src/os/posix/src/os-impl-binsem.c#L392