Make connect completely asynchronous on non-blocking sockets

mboes commented 8 years ago

If I understand the current connect code correctly, it calls connect(2), then checks whether the socket was non-blocking. If it is, then it waits, and then it connects again to detect any error condition properly. The fact that connect is blocking, potentially for an arbitrarily long time, even for sockets explicitly marked as non-blocking, is problematic. In particular, because it can take a long time for the connection attempt to fail when the remote host is no longer reachable.

The docstring mentions http://cr.yp.to/docs/connect.html regarding potential pitfalls of connect(2) on non-blocking sockets. But I think that document is outdated by now - it's largely concerned with systems that are by now extinct (or on which GHC does not run anyways). In particular, practically all modern systems have getsockopt() now, so it should be possible to use that in order to handle all conditions safely. Google's GRPC does things this way, and is fairly portable. What's good enough for them should be good enough for socket. :)

lpeterse commented 8 years ago

Hi,

first of all thanks for the contribution.

In medias res: All sockets are set to non-blocking on creation, but to a library user they appear to behave blocking. All other library operations (like send or receive and the like) are also blocking. The noteworthy point is that the blocking is achieved through GHC's RTS and its eventing system instead of blocking on the syscall (as a traditional blocking socket would do). The thread that issues the connect call gets blocked by yielding control to the RTS and is resumed as soon as the RTS receives an event for the file descriptor. However, the program does not yield and loose control to the operating system. The RTS scheduler can continue to employ the OS thread for other things as long as the socket is not yet ready with whatever action was requested. In case the non-threaded runtime is used this is crucial for keeping the program reactive at all (the network library hangs during connect calls on Windows!).

Alas, I'm not really happy with the current implementation.

It's not about the long potential timeout. Assuming my assumptions are correct, the underlying c_connect call always returns immediately - it's only that the connect will not return as long as the connection state has not been confirmed. The RTS will then continue executing other threads (if any).

As Haskell threads are cheap one could supervise the connect call with another explicit timeout that eventually cancels the connection attempt.

withTimeout :: Int -> IO a -> IO a
withTimeout = ...

foo = do
  withTimeout 60 $ connect socket address
  ..

The problem is rather that on a failed connect, the current mechanism (that uses a second connect) will then actually succeed with starting the second connection attempt (the library incorrectly throws the eTimedOut in this case).

The question is: How to determine the success/failure of a connection attempt on a non-blocking socket? My goal was that the outcome shall be known when leaving the connect operation. The alternative would have been that the socket state is unknown after the connect and a user might errorneously assume that the socket is connected and it will fail (far) later.

Did I miss something? How would you say the connect should behave? What would be least surprising? I hope I got your point.

lpeterse commented 8 years ago

I just stumbled upon a suggestion by @atlaua in haskell/network issue #130:

An alternative implementation of isConnected should be possible by calling getpeername() and checking for an ENOTCONN error.

I is already mentioned in http://cr.yp.to/docs/connect.html. I don't remember why I decided to implement the inferior second-connect alternative.

lpeterse commented 8 years ago

For the record, here is what the Linux manpage says for connect:

  EINPROGRESS
       The socket is nonblocking and the connection cannot be
         completed immediately.  It is possible to select(2) or poll(2)
         for completion by selecting the socket for writing.  After
         select(2) indicates writability, use getsockopt(2) to read the
         SO_ERROR option at level SOL_SOCKET to determine whether
         connect() completed successfully (SO_ERROR is zero) or
         unsuccessfully (SO_ERROR is one of the usual error codes
         listed here, explaining the reason for the failure).

I'll investigate whether this is portable to Windows.

lpeterse commented 8 years ago

Grr, works on Linux, but not (yet) on Windows:

System
  Socket
    connect
      connect to closed port on inetLoopback: FAIL
        connection should have failed
      connect to closed port on inetNone:     FAIL
        Exception: eAddressNotAvailable

The second test yields a different exception than expected and the first test claims the connection has been established which is definitely not true.

Will investigate further...

lpeterse commented 8 years ago

Ok. Here is what I found out so far:

      Just wait -> do
        -- This either waits or does nothing.
        wait
        -- Use `getsockopt` to get the actual socket errno.
        Error se <- getSocketOption (Socket mfd)
        -- Throw exception when status code is != 0.
        when (se /= eOk) (throwIO se)

On Windows the wait does not really wait until the socket signals writability, but just introduces a fix delay. It seems like getsockopt(fd, SOL_SOCKET, SO_ERROR, &err) returns 0 itself and 0 as error value while the socket connection is pending. If I introduce a delay of 1 second before calling it I get the expected ECONNREFUSED.

lpeterse commented 8 years ago

On Windows the only mechanism I can think of to wait for the connection to either fail or succeed is indeed select:

https://msdn.microsoft.com/de-de/library/windows/desktop/ms740141(v=vs.85).aspx

lpeterse commented 8 years ago

For Windows, I developed a solution that uses select and an exponential backoff waiting mechanism. I also added and validated tests for this.

lpeterse / haskell-socket

Make connect completely asynchronous on non-blocking sockets #15