Someone on IRC claims we are leaking file descriptors

hyperthunk commented 7 years ago

I think this requires calling reconnect and it should be in here already, but I'll double check. Via @agentm on IRC afaik.

hyperthunk commented 7 years ago

Hmm, seems likely to be a network-transport-tcp thing actually. I'm pretty certain this library isn't doing anything wrong. Will discuss with @agentm what they're seeing and close or update accordingly.

agentm commented 7 years ago

To be fair, I suspect that I am at fault for the leaking descriptors and trying to figure out how to resolve it.

In our case, we have a server process proxying access to a d-p-client-server process.

I understand the philosophy of retaining connections, but, in this client-server case, the lifetime for connect/disconnect is clear: the websocket disconnects, so the proxied connection should be disconnected as well. Currently, that's not happening.

If I add the reconnect call to the server-side handler for logging out (via handleCall ala this), then the server process dies and gets unregistered. I suppose I should be catching the exception, but I'm not sure where it makes sense or how to recover on the server-side since I do want the client-side to receive the killed exception. Is this a case for monitoring? Is there an example I could follow?

On a side note, I assumed that by using the same remote NodeId to connect (returned from whereisRemote), the same socket would be re-used for all connections but that appears not to be the case. I'm not sure what I am missing there.

hyperthunk commented 7 years ago

Hi @agentm. First of all, thanks for using the library - sorry I've been awol and not maintaining, I am getting back on it now!

In our case, we have a server process proxying access to a d-p-client-server process.

There's no wrapping going on there, you spawn a new managed server and it runs on the DB node, that looks fine.

I understand the philosophy of retaining connections, but, in this client-server case, the lifetime for connect/disconnect is clear: the websocket disconnects, so the proxied connection should be disconnected as well. Currently, that's not happening.

I'm a little unclear on how this happens. Where is the websocket code that does this? All I can see is this client module.

If I add the reconnect call to the server-side handler for logging out (via handleCall ala this), then the server process dies and gets unregistered.

I'm a bit unclear on this. You wouldn't need to call reconnect here. The business of reconnect has to do with when the network connection between two nodes drops and you try sending afterwards, and looking at your code I'm not convinced that's what we're seeing.

The general paradigm for managed servers is that the server is long lived and handles multiple clients, which are differentiated internally by tagging client requests with a unique identifier and relying on Cloud Haskell's ordering guarantees between two isolated processes. The server process should keep running until you shut it down, by sending a Shutdown signal (for which there is a structured API, which supports shutdown/cleanup logic, etc.

I suppose I should be catching the exception, but I'm not sure where it makes sense or how to recover on the server-side since I do want the client-side to receive the killed exception. snip

Is this a case for monitoring? Is there an example I could follow?

Failures on the server will be automatically detected by clients that are using the call API. If an un-handled exception occurs on the server and no handler is installed to deal with it, the server process will stop with the exception data stored in its ExitReason. On the client side, call is implemented (roughly) in terms of monitor >>= send >>= receive{Timeout} which means that errors (or in this case, an exit signal) on the server are returned to the waiting client via the monitor. The safeCall API that you're using wraps this up and returns Either so you should be fine.

hyperthunk commented 7 years ago

On a side note, I assumed that by using the same remote NodeId to connect (returned from whereisRemote), the same socket would be re-used for all connections but that appears not to be the case. I'm not sure what I am missing there.

In Cloud Haskell there are heavy weight (TCP/IP) connections, and lightweight connections that multiplex over the TCP/IP connection. All interactions between two processes on separate nodes use the network-transport layer to establish a lightweight connection - a full connection over which to multiplex is created if not already present - and this connection is reused whenever possible. Only network failures between the two nodes should cause the issue to which reconnect is linked, which is that once a connection between endpoints has failed, you cannot send again unless you re-establish the connection. The purpose of reconnect is to acknowledge that you understand that no ordering guarantees can be maintained in the face of such connection failures.

hyperthunk commented 7 years ago

There's no wrapping going on there, you spawn a new managed server and it runs on the DB node, that looks fine.

Oh wait, I'm quite wrong, I can see what you mean by wrapping now....

Although this shouldn't be a problem - viz one managed process calling another - I wouldn't structure your process tree like that. Let me look through the code a bit more and I'll try to advise.

hyperthunk commented 7 years ago

Right... It's a little unclear what you're doing, but if I've understood it correctly, then I think I've found the problem in the CH code:

timeoutOrDie :: Timeout -> IO a -> Process (Either ServerError a)
timeoutOrDie micros act = do
  if micros == 0 then
    liftIO act >>= \x -> pure (Right x)
    else do
    mRes <- liftIO (timeout micros act)
    case mRes of
      Just res -> pure (Right res)
      Nothing -> pure (Left RequestTimeoutError)

That timeout is potentially very wrong. When you call these various functions in your Client.hs code, you're spawning a process to handle the communication with the remote server, but doing that from the IO monad. You're then potentially terminating that process, which could leave all kinds of orphaned threads and objects laying around. This isn't how the API is supposed to be used and failures will silently disappear.

A quick fix would be to use the callTimeout in ManagedProcess.Client, which has the right semantics and runs in the process monad. Personally though, I'd be more inclined to use callChan and use the CH primitive receiveChanTimeout on the returned ReceivePort, since that's a more idiomatic approach (for Cloud Haskell generally) and typed channels are designed to be throw-away.

That's the workaround at least. I should point out here that it's still not clear whether or not leaking Cloud Haskell resources has anything to do with your running out of file descriptors.

Now what I /would/ do if I were you, is re-design the process tree a little bit. First of all, it seems to me that your main (outer) server is mainly there to maintain a list of current active sessions. That's great, so maybe consider abstracting your sessions out into their own processes. If your sessions need to wrap a handle to the underlying database/file-system/etc then you can use STM to share data between a parent and child processes (as long as the children are spawned locally to the parent process).

There is a great API available for doing this, in the distributed-process-async package. For starters, your timeout code would be greatly simplified:

testAsyncWaitTimeout result = do
    hAsync <- async $ task (expect :: Process ())
    waitTimeout 100000 hAsync >>= stash result
    cancelWait hAsync >> return ()

If you're going to use async in a managed process then I suggest you combine this with client channels for ease of use. There's a good example in the Task API and the API Documentation is pretty clean and easy to follow. Note that async uses STM internally.

But I wouldn't even bother with this tbh. Instead of the session management server performing I/O and/or STM actions, you should break out the logic for an individual session into its own process and have the parent either return an opaque handle to the session (which the websocket server can store in memory for its persistent connection) or you can use the parent process as a dispatcher if you prefer. If child processes need to coordinate around a shared resource then you either isolate that in its own process (be it the parent server process, or some other process) or use a shared STM handle that all the children inherit. There are lots of examples of this that I can provide.

If you choose to isolate sessions in their own child processes, then I'd suggest supervising them. The easiest way to do this would be to spawn a supervision tree linked to the session management server, and add child handles for each new session (you can wrap managed-process startup functions using the ToChildStart typeclass). Even if you don't automatically restart failed sessions using the supervisor, you can segregate the processes responsible for various tasks in the system (transaction management, I/O, etc) in this way and have sub-trees of supervised processes managed centrally. Of course, monitoring supervised children is very easy too.

If you're feeling brave, you might also want to take a look at the resource pool implementation here. In particular it handles resource acquisition and both deliberate and accidental client disconnects, all using the managed process API. The latter (monitor notifications of disconnects) are handled by the backing pool (i.e., the implementation the server defers to at runtime - see the example code here. Please note that the pool implementations aren't fit for production use yet, there are some bugs and failing tests so please don't use these or copy verbatim - I'm just pointing out techniques you might want to be aware of when constructing complex managed processes using -client-server.

If the -task or -supervisor libraries are failing in CI it is probably because the travis config is very out of date. I will try and go through and fix these asap.

Do please shout with any further questions - I'm most happy to help, and of course will help hunt for leaking file handles as best I can too.

agentm commented 7 years ago

Wow, thanks for pointing out the mistaken assumptions in my code and the additional details. I will definitely want to fix the timeout.

I spent a few hours narrowing down the file handle issue specifically and was able to narrow it down to my misunderstanding of "ownership" of the transport. When a new local node is created, it adds an endpoint to the transport and closes out the endpoint in closeLocalNode but doesn't close the transport, thus hanging onto the socket. In hindsight, it's clear that many nodes can share a single transport.

I am able to resolve the leaking handles by retaining a reference to the transport and closing it separately.

hyperthunk commented 7 years ago

Ah perfect, makes a good but of sense and I'm glad you found it!

Feel free to shout with and further / future issues of questions!

haskell-distributed / distributed-process-client-server

Someone on IRC claims we are leaking file descriptors #7