nasa / fprime

F´ - A flight software and embedded systems framework
https://fprime.jpl.nasa.gov
Apache License 2.0
9.95k stars 1.28k forks source link

Fix failures of tcp UT #2709

Closed JohanBertrand closed 1 month ago

JohanBertrand commented 2 months ago
Related Issue(s) https://github.com/nasa/fprime/issues/2706
Has Unit Tests (y/n) Y
Documentation Included (y/n) Y

Change Description

Fix issue where the TCP client/server would fail to connect to a socket, generating an error and making the UT to fail.

Partially solves https://github.com/nasa/fprime/issues/2706, as this does not solve the warnings/silent failures

Rationale

The TCP unit tests should pass in a reliable manner. This also can affect the flight code if the "bind to 0" strategy is used to select a port.

LeStarch commented 1 month ago

I studied up on TCP and this is what I was able to deduce:

  1. TIME_WAITs happen on the initiator of the close/shutdown call. Thus clients should always initiate the shutdown
  2. close and/or shutdown is asynchronous unless the SO_LINGER option is set.

I think that means the following should be done to the UTs:

  1. Ensure the client always closes first and
  2. Make sure to use SO_LINGER on UT sockets. This will ensure that things are closed fully before reuse.

Since the error originates from the find port helper, it might be sufficient to just apply number 2 to the port bound there.

This work won't hurt, but may be masking the underlying issue. Thoughts?

Thanks for your patience while I study up!

JohanBertrand commented 1 month ago

Thanks for the details. I just gave it a try. I set the reuse_address parameters to false (their default values), and set SO_LINGER on the port helper function, but I still get the same errors:

/home/fprime/Drv/TcpClient/test/ut/TcpClientTester.cpp:40: Failure
Expected equality of these values:
  serverStat
    Which is: -9
  SOCK_SUCCESS
    Which is: 0
TCP server startup error: Address already in use

I pushed the changes so that you can give it a try on your side. I agree with your analysis, but I don't understand why we're having such issues even with SO_LINGER activated and set to 0.

I think that we should go ahead with the reuse_address option, either as a function parameter, or as something fixed in the code. I'm not sure if there is a real drawback of having it activated by default.

LeStarch commented 1 month ago

The issue I see is that setting SO_LINGER to 0 has almost no effect. This is because that value is the timeout, and a timeout of zero will exit immediately yielding the same problem. I have pushed a fix that sets it to 30.

JohanBertrand commented 1 month ago

I have exactly the same error messages on my side with the timeout of 30

LeStarch commented 1 month ago

@JohanBertrand can you remind me again what your testing environment is? Is it a docker container? Any special networking? Host machine type?

I have yet to be able to reproduce the issue to have the failures appear reliably....but I want to get something to reproduce it.

LeStarch commented 1 month ago

@JohanBertrand As a reminder, this is an open forum. So keep the information succinct.

  1. native, VM, docker
  2. Host OS (mac/darwin/win), Guest OS (linux/other)
  3. Any special networking config
JohanBertrand commented 1 month ago

I am currently running an ubuntu:20.04 docker instance on WSL2, with the script mentioned in https://github.com/nasa/fprime/issues/2706

I don't have any special network settings on the docker

Usually, if you run the test about 20 times, you should get at least one test failing

LeStarch commented 1 month ago

First of all, thank you. That was key-information and I can now reliably reproduce the problem. I had a thought on how to fix the issue altogether.

In short:

  1. Allow TcpServer to accept port "0" (choose a port)
  2. Allow TcpServer to return the port hosted at
  3. Update the test to skip port-selector altogether, and just choose any port and reuse that in the clients.

I'll take a crack at this tomorrow. In the meantime, we should determine if we want to keep reuse around (assuming this pans out).

LeStarch commented 1 month ago

@JohanBertrand I think I finally figured out some answers! Why doesn't the original code work? Why do the things that "should" fix the problem not fix the problem? Why does logic seem to fail when discussing theses issues? Why did port-selector need to recurse to avoid "root only" ports in a non-root program?

Here is the answer: notice the asymmetry https://github.com/nasa/fprime/blob/0fe467c6e21aad321a0f1c737576733aac582443/Drv/Ip/test/ut/PortSelector.cpp#L23

when compared against: https://github.com/nasa/fprime/blob/0fe467c6e21aad321a0f1c737576733aac582443/Drv/Ip/test/ut/PortSelector.cpp#L41

There is no network byte swap when reading back the port. This means that we might select a "good" port, and then byte-swap it into a bad port, root port, reserved port, etc.

  1. Why doesn't the original code work? The detected good port is corrupted before use, loosing its guarantee of availability.
  2. Why do things that should fix the issue fail? We were using SO_LINGER and other mechanisms on a port that was the pre-corrruption.
  3. Why does logic fail? Because logic was based on a bad understanding of which port was really used.
  4. Why could we get root only ports? Because post-swap could enter reserved space

I have addressed this, and added the capability for TcpServer and UdpRecv to use "0" (Os chooses a port) in this PR: https://github.com/nasa/fprime/pull/2739.

Would you mind reviewing?

JohanBertrand commented 1 month ago

Awesome! Thanks you for looking into that and for providing those details.

I'm going to look into the fix in the PR.

FYI, we might still need to have a recursive call to the get_free_port function to be able to find two ports which are different for the UDP test (See the changes in this PR)

LeStarch commented 1 month ago

@JohanBertrand since we found the root cause, I am going to close this PR. If you feel there is still need for reconnect behavior, we can track that as a dedicated issue.