Open essdotteedot opened 6 years ago
I looked in the Lwt source code, and a bit in OCaml, and I don't immediately see the problem. Since I don't have easy access to a MinGW install to try things out, here is what I suggest:
Lwt_io.open_connection
mainly calls Lwt_unix.socket
and Lwt_unix.connect
(links are to the source). Lwt_unix.socket
is a pretty thin wrapper around Unix.socket
, and I don't see anything noteworthy there. Lwt_unix.connect
is ultimately a wrapper around Unix.connect
. However, it seems to call Unix.connect
in a retry loop (whose "body" is spread across several functions here). I wonder if this loop could be causing the hang.
I'm not sure about the extra connections. My suggestion for that is to write a similar short program using only Unix
from the standard library, and see what the behavior is and whether extra connections appear, if you haven't already done so. Then, we can be sure it's an Lwt problem. We may have to cc fdopen or other people with proper Windows experience to debug this.
Programs written using Unix don't seem to have the same problems on Windows.
let () =
let _ = Unix.open_connection (Unix.ADDR_INET (Unix.inet_addr_of_string "127.0.0.1",46000)) in
()
let () =
let sock_addr = Unix.ADDR_INET (Unix.inet_addr_of_string "127.0.0.1",46000) in
let sock = Unix.socket (Unix.domain_of_sockaddr sock_addr) Unix.SOCK_STREAM 0 in
Unix.connect sock sock_addr
Both of the above programs error out with Fatal error: exception Unix.Unix_error(63, "connect", "")
. No extra connections are created.
The client/server programs written using Unix don't create extra connections either.
let () =
let rec handler out_ch = output_string stdout "Hello\n" ; output_string out_ch "Hello\n" ; handler out_ch in
let _, out_ch = Unix.open_connection (Unix.ADDR_INET (Unix.inet_addr_of_string "127.0.0.1",46000)) in
handler out_ch
let () =
let rec handler_fn in_ch out_ch =
let s = input_line in_ch in
output_string stdout ("Got " ^ s ^ " from client.") ;
output_string out_ch s ;
handler_fn in_ch out_ch
in
let sock_addr = Unix.ADDR_INET (Unix.inet_addr_any , 46000) in
let sock = Unix.socket (Unix.domain_of_sockaddr sock_addr) Unix.SOCK_STREAM 0 in
Unix.bind sock sock_addr ;
Unix.listen sock 3;
let (s, _) = Unix.accept sock in
let in_ch = Unix.in_channel_of_descr s
and out_ch = Unix.out_channel_of_descr s in
handler_fn in_ch out_ch
Starting server then client results in the following netstat output :
Active Connections
Proto Local Address Foreign Address State PID ... TCP 0.0.0.0:46000 0.0.0.0:0 LISTENING 6172
[server2.exe] TCP 127.0.0.1:46000 127.0.0.1:53256 ESTABLISHED 6172 [server2.exe]
TCP 127.0.0.1:53256 127.0.0.1:46000 ESTABLISHED 4624 [client2.exe] ...
I can investigate further by looking at the lwt codebase.
I can investigate further by looking at the lwt codebase.
That would probably be the most helpful. Thanks if you do.
In addition to the code I linked above, Lwt_engine
is involved. It implements the I/O loops that are "pumped" by Lwt_main
. You're most likely using the select
engine, so only that code should be relevant (but do check which engine is actually being used).
If you haven't already done so, I suggest getting the Lwt source and inserting prints to figure out what is really being called and when among all these functions. Using prerr_endline
and/or Printf.eprintf "...\n%!" ...
should be enough.
Let me know if you have any questions that I can readily answer (i.e. without actually being on a MinGW system :/).
I can confirm that the select engine is being used. The two sockets connections that are always made look to be a Windows implementation detail (I don't think this is pertinent to the issue).
The issue seems to be related to API differences between Linux and Windows when it comes to connect on nonblocking sockets. When calling Lwt_io.open_connection without supplying a file descriptor the first thing it does is create a socket by calling Lwt_unix.socket which sets the socket to be nonblocking. The manpage for connect indicates that
EINPROGRESS The socket is nonblocking and the connection cannot be completed immediately. It is possible to select(2) or poll(2) for completion by selecting the socket for writing. After select(2) indicates writability, use getsockopt(2) to read the SO_ERROR option at level SOL_SOCKET to determine whether connect() completed successfully (SO_ERROR is zero) or unsuccessfully (SO_ERROR is one of the usual error codes listed here, explaining the reason for the failure).
This lines up with the implementation of Lwt_unix.connect for Linux. For Windows the documentation is slightly different :
For connection-oriented, nonblocking sockets, it is often not possible to complete the connection immediately. In such a case, this function returns the error WSAEWOULDBLOCK. However, the operation proceeds. When the success or failure outcome becomes known, it may be reported in one of two ways, depending on how the client registers for notification. If the client uses the select function, success is reported in the writefds set and failure is reported in the exceptfds set.
In the lwt implementation the connection is first attempted then the EWOULDBLOCK is caught and a writeable is registered with the select engine. Finally, the select engine does a blocking (timeout of -1.0) Unix.select passing in the socket for the open connection attempt in the writeable set and an empty error set. At this point the select just waits forever and the error will not be reported in the writable set.
The problem with lwt's nonblocking connect on Windows can be demonstrated with the following Unix based program :
let () =
let sock_addr = Unix.ADDR_INET (Unix.inet_addr_of_string "127.0.0.1",46000) in
let sock = Unix.socket (Unix.domain_of_sockaddr sock_addr) Unix.SOCK_STREAM 0 in
Unix.set_nonblock sock ;
try
Unix.connect sock sock_addr
with
| Unix.Unix_error (Unix.EWOULDBLOCK, _, _) -> let _ = Unix.select [] [sock] [] (-1.0) in ()
| _ -> assert false
The above program just waits forever on the select call. If the program is modified so that the socket is passed into the error set as well on the select call then it does not hang and exits.
Excellent investigation, thank you.
It looks like the select engine and the code that interacts with it in Lwt_unix
would need non-trivial modification to fix this the "right way." There might be a more readily-coded dirty fix, but I don't see it.
A really dirty fix might be to put the socket into blocking mode for connect
, and run the attempt in a worker thread. This will also have performance implications due to the round-trip to the worker thread pool.
@essdotteedot You may benefit by replacing lwt.unix
by uwt
, which is based on libuv and probably has most issues like this one resolved. I'm under the impression that uwt works well in production on Windows.
(...and we would like to replace lwt.unix
by uwt
in this repo, #328).
Thank you for the comments. I was not aware of uwt, hopefully it solves my Windows issues.
The following program hangs on Windows :
On Linux this works as expected (the program errors out with
Unix.Unix_error (Unix.ECONNREFUSED, "connect", "")
).Doing a
netstat -a -b -n -o
on Windows I see the following:These two additional connections happen even when a successful connection is established. Consider the following programs :
client.ml
server.ml
Running server then client the netstat output is as follows :
I've tested against, lwt 3.2.1, safer-semantics branch, pre-compiled OCaml version 4.02.3+mingw64c and 4.05.0+mingw64c. The pre-compiled OCaml versions are sourced from https://github.com/fdopen/opam-repository-mingw.