nasa / osal

The Core Flight System (cFS) Operating System Abstraction Layer (OSAL)
Apache License 2.0
545 stars 213 forks source link

Race condition in "select" test #707

Closed jphickey closed 3 years ago

jphickey commented 3 years ago

Describe the bug Running the OSAL select test, I ran into a deadlock situation where the "multi" test got stuck and never finished.

To Reproduce Hit or miss... Run test repeatedly on a system with other loads (e.g. parallel builds)

Expected behavior Test should complete

Code snips Checking the test status/backtrace it looks like two tasks (main + "Server_Fn") are waiting on the binary sem. In particular the Server_Fn is stuck here:

https://github.com/nasa/osal/blob/d698a4d7ddc66826a4c7c287468b5f1aa6ca372f/src/tests/select-test/select-test.c#L162

While the main task is waiting in the teardown code (the TestSelectMultipleRead has completed, and it has invoked Teardown_Multi which in turn invokes Teardown_Single here):

https://github.com/nasa/osal/blob/d698a4d7ddc66826a4c7c287468b5f1aa6ca372f/src/tests/select-test/select-test.c#L273

System observed on: Ubuntu 20.04

Additional context This is likely related to the use of OS_BinSemFlush. We should probably deprecate this function, as I cannot see how this can ever be used safely without it being a race condition. VxWorks offers it which (I think) is why OSAL also offers it, but its a fundamentally broken concept.

I can confirm that looking at the traceback in gdb, the flush_count is indeed already 1 - meaning the flush had already happened by the time the Server_Fn entered the bin sem take routine.

Reporter Info Joseph Hickey, Vantage Systems, Inc.

jphickey commented 3 years ago

Discovered another race condition in here - this uses the same TCP port number as the "network-api-test" does. This means when running ctest with a "-j8" option (parallel) that it might run at the same time as network-api-test, which causes it to fail.

Still not clear on why this wasn't part of the network-api-test as I originally suggested - which would have avoided this - but nonetheless it has been implemented separately so this becomes a race condition issue. Should use a different port number for this test so it doesn't interfere with the other test.

zanzaben commented 3 years ago

It was not put into the network api test because not all the select tests use network features. As for selecting a new port number, is there a list of all the ports that are already being used or do I just have to search through the tests to find them.