Closed jphickey closed 3 years ago
Discovered another race condition in here - this uses the same TCP port number as the "network-api-test" does. This means when running ctest with a "-j8" option (parallel) that it might run at the same time as network-api-test, which causes it to fail.
Still not clear on why this wasn't part of the network-api-test as I originally suggested - which would have avoided this - but nonetheless it has been implemented separately so this becomes a race condition issue. Should use a different port number for this test so it doesn't interfere with the other test.
It was not put into the network api test because not all the select tests use network features. As for selecting a new port number, is there a list of all the ports that are already being used or do I just have to search through the tests to find them.
Describe the bug Running the OSAL select test, I ran into a deadlock situation where the "multi" test got stuck and never finished.
To Reproduce Hit or miss... Run test repeatedly on a system with other loads (e.g. parallel builds)
Expected behavior Test should complete
Code snips Checking the test status/backtrace it looks like two tasks (main + "Server_Fn") are waiting on the binary sem. In particular the Server_Fn is stuck here:
https://github.com/nasa/osal/blob/d698a4d7ddc66826a4c7c287468b5f1aa6ca372f/src/tests/select-test/select-test.c#L162
While the main task is waiting in the teardown code (the
TestSelectMultipleRead
has completed, and it has invokedTeardown_Multi
which in turn invokesTeardown_Single
here):https://github.com/nasa/osal/blob/d698a4d7ddc66826a4c7c287468b5f1aa6ca372f/src/tests/select-test/select-test.c#L273
System observed on: Ubuntu 20.04
Additional context This is likely related to the use of
OS_BinSemFlush
. We should probably deprecate this function, as I cannot see how this can ever be used safely without it being a race condition. VxWorks offers it which (I think) is why OSAL also offers it, but its a fundamentally broken concept.I can confirm that looking at the traceback in gdb, the flush_count is indeed already 1 - meaning the flush had already happened by the time the Server_Fn entered the bin sem take routine.
Reporter Info Joseph Hickey, Vantage Systems, Inc.