avieth commented 7 years ago

When testing components built atop network-transport-tcp, it's sometimes desirable to have a test case create an end point at the beginning of the test and close it at the end, where all test cases share the same transport. If we run such a test repeatedly, as is the typical style when quick-checking, for instance, then we risk undefined behaviour: when the next test case runs, the end point from the previous test case, although closed, may still be receiving on one of its sockets. The sockets have all been closed, their file-descriptors freed up, and so we get undefined behaviour. This can be catastrophic: the stray thread can steal recvs from the next end point's sockets, completely breaking the protocol and for instance causing the end point to receive an incorrect length prefix.

An example that I observed

When the next end point opens a heavyweight connection, the stray thread may read the end point identifier (int32), causing the new end point to read the length prefix of the end point address as the end point id, and the first 4 bytes of the end point address as the length of the end point address itself. That will either cause it to get the wrong end point address, or as I observed, to hang indefinitely because those first 4 bytes are a very large number.

Solution

I think this patch solves the issue, but this is about concurrency so I'm always cautious. Please take a look. I have a test case like what I described above: client and server create new end points at each iteration, exchange a send, then close up. Without this patch, it'll typically freeze (but not always!). With this patch, it seems to never do so, but the tests run quite a bit more slowly, and that makes sense: closeEndPoint now has to wait for its sockets handler threads to finish.

To start, take a look at forkServer. It's been changed so that you can give a choice of handlers: a Right if you want forkServer to take care of closing the socket when your handler is done, and a Left for the old behaviour. Both are there because the test cases rely on the old behaviour.

The forkServer used by createTransport will use Right. This means that forkServer will fork a new thread for each connection, and run handleConnectionRequest, which no longer forks handleIncomingMessages. Instead, it finishes when handleIncomingMessages finishes. That's the action which repeatedly recvs on the socket, so it's good to know that the socket will be closed only when that action has finished (and no more recvs will happen).

The handler is also given an IO () which will return immediately after the socket is indeed closed. This is put into the ValidRemoteEndPointState and apiCloseEndPoint makes use of it to ensure that it returns only when every remote end point's socket has been closed.

The other point at which a socket is created, socketToEndPoint, implements the same scheme: if a socket is created then there's also an IO () which returns when that socket has been closed (when handleIncomingMessages has finished).

facundominguez commented 7 years ago

The commit message makes a good job at explaining the problem. I think it will help the review if we also have an explanation of the solution.

avieth commented 7 years ago

The commit message makes a good job at explaining the problem. I think it will help the review if we also have an explanation of the solution.

Fair point :) Tests are failing so I'm working on that. Will update the description once I'm finished.

avieth commented 7 years ago

OK tests are passing, description updated.

qnikst commented 7 years ago

What are the tests that rely on the old behavior, if I understand correctly in old behavior we can access socket from several threads after close, also we can close socket from different places. If that is the case I'd rather kept only new behavior, because old one is really broken, and not just for the tests.

avieth commented 7 years ago

What are the tests that rely on the old behavior, if I understand correctly in old behavior we can access socket from several threads after close, also we can close socket from different places. If that is the case I'd rather kept only new behavior, because old one is really broken, and not just for the tests.

I think it's just testUnidirectionalError. Here's a comment from it:

We accept connections, but when an exception occurs we don't do anything (in particular, we don't close the socket). This is important because when we shutdown one direction of the socket a recv here will fail, but we don't want to close that socket at that point (which would shutdown the socket in the other direction).

qnikst commented 7 years ago

oh.. I think I'll leave decision to @facundominguez. But I'm not sure I'm happy with the old situation. Even if we do not close socket (as in Right branch) n-t-tcp reader threads should be careful enough to not read from sockets that doesn't belong to it.

avieth commented 7 years ago

Bonus!

This patch eliminates one very easy denial-of-service attack.

-- Attacker.hs
import Network.Socket

main = do
  sock <- socket AF_INET Stream defaultProtocol
  connect sock (SockAddrInet 7777 (tupleToHostAddress (127, 0, 0, 1)))

  -- At this point, the network transport running at 127.0.0.1:7777 will not
  -- accept any connections from any other clients. Its server thread is
  -- waiting for this socket to give more data before it forks.
  putStrLn "Press any key to stop the attack"
  getChar

  close sock

But it also reveals another denial-of-service attack: repeatedly connect and hold the socket open until the service runs out of file handles and therefore closes down (accept throws an exception), killing service to all other clients.

facundominguez commented 7 years ago

It is not clear that forkServer needs to change. The new behavior can be encoded as

forkServer' ...  requestHandler = forkServer ... $ \sock ->
  forkIO $ do
    mv <- newEmptyMVar
    requestHandler (readMVar mv) sock
      `finally` close sock

or similar, please forgive that I do not copy all the details.

This way the old test is unaffected.

avieth commented 7 years ago

It is not clear that forkServer needs to change.

Sure, we could leave it as is and add the new functionality inside the requestHandler. We'd have to bracket the whole thing I believe, rather than just use a finally, as you pointed out in another comment.

But I think a change to forkServer is a good idea. If we leave it as is, we always have tryCloseSocket running whenever there's an error in the request handler. But our request handler will close the socket using its own bracket. There's no sense closing it twice. So take tryCloseSocket out of forkServer? Seems rather unsafe. In my opinion is just makes the most sense that forkServer, the program which creates the socket, ought to be responsible for closing the socket.

facundominguez commented 7 years ago

Fair points. Another alternative is to keep only the new behavior and modify the test so the thread handling requests doesn't terminate until the test finishes.

avieth commented 7 years ago

Fair points. Another alternative is to keep only the new behavior and modify the test so the thread handling requests doesn't terminate until the test finishes.

I updated testUnidirectionalError so that it doesn't need the old behaviour. Also patched the accept part of forkServer to mask async exceptions until the request handler is forked, and always close the socket if (somehow) there's a synchronous exception before that happens.

EDIT: but tests fail on 2 of 3 GHC versions... will look into this.

facundominguez commented 7 years ago

CI is stubbornly failing with GHC 7.6.3.

This is starting to look good. At least I cannot find more problems by eye inspection.

avieth commented 7 years ago

CI is stubbornly failing with GHC 7.6.3.

Weird. The output is not very helpful...

This is starting to look good. At least I cannot find more problems by eye inspection.

FWIW the main piece of this pull request (840e4ba0c87a765b60513a126d208071f61eb73c) is already put to use in a project and works well.

facundominguez commented 7 years ago

I have made cabal test to stream the output in master. That should at least show which test blocks.

avieth commented 7 years ago

Ok, tests are passing. testReconnect was liable to time out on 7.6.3 but I've made it more robust I think... perhaps there are some subtle differences in MVar implementations between 7.6.x and 7.8.x ?

facundominguez commented 7 years ago

Merging! Thx @avieth.

haskell-distributed / network-transport-tcp

apiCloseEndPoint blocks until no longer reciving #49

An example that I observed

Solution

Bonus!