Closed JohanBertrand closed 1 month ago
I studied up on TCP and this is what I was able to deduce:
I think that means the following should be done to the UTs:
Since the error originates from the find port helper, it might be sufficient to just apply number 2 to the port bound there.
This work won't hurt, but may be masking the underlying issue. Thoughts?
Thanks for your patience while I study up!
Thanks for the details. I just gave it a try. I set the reuse_address
parameters to false
(their default values), and set SO_LINGER
on the port helper function, but I still get the same errors:
/home/fprime/Drv/TcpClient/test/ut/TcpClientTester.cpp:40: Failure
Expected equality of these values:
serverStat
Which is: -9
SOCK_SUCCESS
Which is: 0
TCP server startup error: Address already in use
I pushed the changes so that you can give it a try on your side. I agree with your analysis, but I don't understand why we're having such issues even with SO_LINGER
activated and set to 0
.
I think that we should go ahead with the reuse_address
option, either as a function parameter, or as something fixed in the code.
I'm not sure if there is a real drawback of having it activated by default.
The issue I see is that setting SO_LINGER
to 0
has almost no effect. This is because that value is the timeout, and a timeout of zero will exit immediately yielding the same problem. I have pushed a fix that sets it to 30
.
I have exactly the same error messages on my side with the timeout of 30
@JohanBertrand can you remind me again what your testing environment is? Is it a docker container? Any special networking? Host machine type?
I have yet to be able to reproduce the issue to have the failures appear reliably....but I want to get something to reproduce it.
@JohanBertrand As a reminder, this is an open forum. So keep the information succinct.
I am currently running an ubuntu:20.04
docker instance on WSL2, with the script mentioned in https://github.com/nasa/fprime/issues/2706
I don't have any special network settings on the docker
Usually, if you run the test about 20 times, you should get at least one test failing
First of all, thank you. That was key-information and I can now reliably reproduce the problem. I had a thought on how to fix the issue altogether.
In short:
I'll take a crack at this tomorrow. In the meantime, we should determine if we want to keep reuse around (assuming this pans out).
@JohanBertrand I think I finally figured out some answers! Why doesn't the original code work? Why do the things that "should" fix the problem not fix the problem? Why does logic seem to fail when discussing theses issues? Why did port-selector need to recurse to avoid "root only" ports in a non-root program?
Here is the answer: notice the asymmetry https://github.com/nasa/fprime/blob/0fe467c6e21aad321a0f1c737576733aac582443/Drv/Ip/test/ut/PortSelector.cpp#L23
when compared against: https://github.com/nasa/fprime/blob/0fe467c6e21aad321a0f1c737576733aac582443/Drv/Ip/test/ut/PortSelector.cpp#L41
There is no network byte swap when reading back the port. This means that we might select a "good" port, and then byte-swap it into a bad port, root port, reserved port, etc.
I have addressed this, and added the capability for TcpServer and UdpRecv to use "0" (Os chooses a port) in this PR: https://github.com/nasa/fprime/pull/2739.
Would you mind reviewing?
Awesome! Thanks you for looking into that and for providing those details.
I'm going to look into the fix in the PR.
FYI, we might still need to have a recursive call to the get_free_port
function to be able to find two ports which are different for the UDP test (See the changes in this PR)
@JohanBertrand since we found the root cause, I am going to close this PR. If you feel there is still need for reconnect behavior, we can track that as a dedicated issue.
Change Description
Fix issue where the TCP client/server would fail to connect to a socket, generating an error and making the UT to fail.
Partially solves https://github.com/nasa/fprime/issues/2706, as this does not solve the warnings/silent failures
Rationale
The TCP unit tests should pass in a reliable manner. This also can affect the flight code if the "bind to 0" strategy is used to select a port.