linuxkit / virtsock

Go bindings for virtio and Hyper-V sockets
Other
88 stars 30 forks source link

go: client on Linux get EOF on read #55

Open rn opened 6 years ago

rn commented 6 years ago

When running the sock_stress tests with the server on the host (sock_stress -v 1 -s hvsock) and the client in the Linux VM sock_stress -v 1 -c hvsock://parent the client frequently gets errors on receive, like this

2018/02/07 13:30:41 [00086] Failed to receive: EOF
2018/02/07 13:30:41 [00086] Failed to receive: EOF
2018/02/07 13:30:41 [00086] TX/RX:      49349 bytes in    36.6093 ms
2018/02/07 13:30:41 [00086] Checksums don't match
2018/02/07 13:30:41 [00087] Failed to receive: EOF
2018/02/07 13:30:41 [00087] Failed to receive: EOF

The server does not report any errors and the test works fine in the other direction.

It seems timing sensitive as some connections work fine, even with a similar amount of data transmitted.

The server (on the host) is basically performing:

con := Accept()
io.Copy(con, con)
con.Close()

The client in one go-routine is sending random data and when all data is sent calls (CloseWrite()). In another go routine is it repeatably calling io.ReadAll() of fixed sized chunks (min of buffer or data left to read). It's the io.ReadAll() which sees the EOF before all data is read.

rn commented 6 years ago

With #57 we see a bit more:

$ sock_stress -c hvsock://parent -e -v 1 -i 5 -B 8192 -b 8192
Client connecting to a42e7cda-d03f-480c-9cc2-a4de20abb878:3049197c-facb-11e6-bd58-64006a7986d3
2018/02/08 14:02:44 [00000] TX/RX:      53495 bytes in     4.6309 ms
2018/02/08 14:02:44 [00001] TX/RX:      47466 bytes in     3.8491 ms
2018/02/08 14:02:44 [00002] TX/RX:      29659 bytes in     1.6459 ms
2018/02/08 14:02:44 [00003] Failed to receive after 32768 of 64872 bytes: Error after reading 0 of 8192 bytes: EOF

So basically, half way through receiving 64872B from the host the read gets an EOF. The test run was donw with 8192B buffers but we see similar behaviour with the default 16kB buffers, but there it is more likely to be the last batch which fails. The individual batch failing always seems to fail with 0B read.

rn commented 6 years ago

I get similar error when using a Win 10 Pro (Fall Creators Update) connecting to a Win 10 Pro host where the server runs:

PS> .\sock_stress.exe -v 1 -c hvsock://parent -i 10 -e
Client connecting to a42e7cda-d03f-480c-9cc2-a4de20abb878:3049197c-facb-11e6-bd58-64006a7986d3
2018/02/08 15:32:40 [00000] TX/RX:      17568 bytes in     3.9851 ms
2018/02/08 15:32:40 [00001] Failed to receive after 32768 of 61753 bytes: Error after reading 0 of 16384 bytes: The network connection was aborted by the local system.
rn commented 6 years ago

With the C implementation in c/hvecho.c we see the same issue. hvecho has a server and client mode. The server:

for (;;) {
    res = recv(buf);
    if (res == 0)
        break;
    send(buf);
}
send("Bye");
close();

So basically, the server echo's data received back until the recv() returns 0 indicating shutdown() and then sends a last message before closing.

The client does:

send(msg);
recv(msg);
shutdown(SD_SEND);
recv(msg);

ie after the it receives the echo calls shutdown() and then expects a bye message from the server.

With the server inside a Linux VM and the client on the Windows host, this works fine. Client output(Windows):

Connect to: 73cc71ad-491c-4c0a-af4e-042691ac1bf9:3049197c-facb-11e6-bd58-64006a7986d3
Bytes Sent: 14
Bytes received: 14
->this is a test
Shutdown
Wait for bye
Bytes received: 8
->Bye!

Server output (Linux):

Connect from: 0xffffffff.0x80000008
Bytes received: 14
Bytes sent: 14
Peer closed
Bye Bytes sent: 8

Running the server on Windows and the client on Linux we see: Client (Linux):

Connect to: 0x00000002.0x3049197c
Bytes Sent: 14
Bytes received: 14
->this is a test
Shutdown
Wait for bye
Connection closed

and on the server (windows):

Connect from: 599fed98-901f-4146-9467-6caf94c824a0:00000403-facb-11e6-bd58-64006a7986d3
Bytes received: 14
Bytes sent: 14
Peer closed
Bye Bytes sent: 8

Note that sometimes the Linux client does get the bye message but sometimes it doesn't. So we see the same timing issue we see in the go test program.

rn commented 6 years ago

I also tried to put a syscall.FlushFileBuffers(v.fd) before calling syscall.close() and it marginally improved thinks in that the client now sometimes seems to receive some more data, ie we now occasionally hit: Unexpected EOF after reading 4088 of 8192 bytes...

rn commented 6 years ago

I also tried turning the WSASend() call into a non-overlapping call by syscall.WSASend(v.fd, &b, 1, &bytes, f, nil, nil) (see here) but this has the same issue ie not all data sent by Windows is received by the client if close() is called on the windows host