Closed mboes closed 8 years ago
Hi,
first of all thanks for the contribution.
In medias res: All sockets are set to non-blocking on creation, but to a library user they appear to behave blocking. All other library operations (like send or receive and the like) are also blocking. The noteworthy point is that the blocking is achieved through GHC's RTS and its eventing system instead of blocking on the syscall (as a traditional blocking socket would do). The thread that issues the connect call gets blocked by yielding control to the RTS and is resumed as soon as the RTS receives an event for the file descriptor. However, the program does not yield and loose control to the operating system. The RTS scheduler can continue to employ the OS thread for other things as long as the socket is not yet ready with whatever action was requested. In case the non-threaded runtime is used this is crucial for keeping the program reactive at all (the network library hangs during connect calls on Windows!).
Alas, I'm not really happy with the current implementation.
It's not about the long potential timeout. Assuming my assumptions are correct, the underlying c_connect
call always returns immediately - it's only that the connect
will not return as long as the connection state has not been confirmed. The RTS will then continue executing other threads (if any).
As Haskell threads are cheap one could supervise the connect call with another explicit timeout that eventually cancels the connection attempt.
withTimeout :: Int -> IO a -> IO a
withTimeout = ...
foo = do
withTimeout 60 $ connect socket address
..
The problem is rather that on a failed connect, the current mechanism (that uses a second connect) will then actually succeed with starting the second connection attempt (the library incorrectly throws the eTimedOut
in this case).
The question is: How to determine the success/failure of a connection attempt on a non-blocking socket? My goal was that the outcome shall be known when leaving the connect operation. The alternative would have been that the socket state is unknown after the connect
and a user might errorneously assume that the socket is connected and it will fail (far) later.
Did I miss something? How would you say the connect should behave? What would be least surprising? I hope I got your point.
I just stumbled upon a suggestion by @atlaua in haskell/network
issue #130:
An alternative implementation of
isConnected
should be possible by callinggetpeername()
and checking for anENOTCONN
error.
I is already mentioned in http://cr.yp.to/docs/connect.html. I don't remember why I decided to implement the inferior second-connect alternative.
For the record, here is what the Linux manpage says for connect
:
EINPROGRESS The socket is nonblocking and the connection cannot be completed immediately. It is possible to select(2) or poll(2) for completion by selecting the socket for writing. After select(2) indicates writability, use getsockopt(2) to read the SO_ERROR option at level SOL_SOCKET to determine whether connect() completed successfully (SO_ERROR is zero) or unsuccessfully (SO_ERROR is one of the usual error codes listed here, explaining the reason for the failure).
I'll investigate whether this is portable to Windows.
Grr, works on Linux, but not (yet) on Windows:
System
Socket
connect
connect to closed port on inetLoopback: FAIL
connection should have failed
connect to closed port on inetNone: FAIL
Exception: eAddressNotAvailable
The second test yields a different exception than expected and the first test claims the connection has been established which is definitely not true.
Will investigate further...
Ok. Here is what I found out so far:
Just wait -> do
-- This either waits or does nothing.
wait
-- Use `getsockopt` to get the actual socket errno.
Error se <- getSocketOption (Socket mfd)
-- Throw exception when status code is != 0.
when (se /= eOk) (throwIO se)
On Windows the wait
does not really wait until the socket signals writability, but just introduces a fix delay. It seems like getsockopt(fd, SOL_SOCKET, SO_ERROR, &err)
returns 0 itself and 0 as error value while the socket connection is pending. If I introduce a delay of 1 second before calling it I get the expected ECONNREFUSED
.
On Windows the only mechanism I can think of to wait for the connection to either fail or succeed is indeed select
:
https://msdn.microsoft.com/de-de/library/windows/desktop/ms740141(v=vs.85).aspx
For Windows, I developed a solution that uses select and an exponential backoff waiting mechanism. I also added and validated tests for this.
If I understand the current
connect
code correctly, it callsconnect(2)
, then checks whether the socket was non-blocking. If it is, then it waits, and then it connects again to detect any error condition properly. The fact thatconnect
is blocking, potentially for an arbitrarily long time, even for sockets explicitly marked as non-blocking, is problematic. In particular, because it can take a long time for the connection attempt to fail when the remote host is no longer reachable.The docstring mentions http://cr.yp.to/docs/connect.html regarding potential pitfalls of
connect(2)
on non-blocking sockets. But I think that document is outdated by now - it's largely concerned with systems that are by now extinct (or on which GHC does not run anyways). In particular, practically all modern systems havegetsockopt()
now, so it should be possible to use that in order to handle all conditions safely. Google's GRPC does things this way, and is fairly portable. What's good enough for them should be good enough forsocket
. :)