net: read from TCP socket under load losses data on darwin/amd64 with read deadline

glycerine commented 4 hours ago

Go version

go 1.23.3 on darwin/amd64

Output of `go env` in your module/workspace:

$ go env
GO111MODULE='auto'
GOARCH='amd64'
GOBIN=''
GOCACHE='/Users/jaten/Library/Caches/go-build'
GOENV='/Users/jaten/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/jaten/go/pkg/mod'
GONOPROXY='github.com/glycerine/*'
GONOSUMDB='github.com/glycerine/*'
GOOS='darwin'
GOPATH='/Users/jaten/go'
GOPRIVATE='github.com/glycerine/*'
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/darwin_amd64'
GOVCS=''
GOVERSION='go1.23.3'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/Users/jaten/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD=''
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/b6/q5ms8t4x5cjcpcpyjrb1h91c0000gn/T/go-build2671965393=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

On MacOS Sonoma 14.0 amd64 (Intel) architecture, I observe occassional TCP data loss when reading from a net.Conn with a read deadline.

I recorded packets and can see their delivery to the client (at the OS level) in Wireshark, but the Go client would not receive that data. The expected data was the response to an RPC call. So therefore my client would, at times, timeout waiting for the call response. This caused me to investigate. I've attempted to boil it down to a minimal reproducer, shown below. This only was observable under load; I needed to run at least 12 clients on my 4 core mac to start getting losses. A larger value of GOMAXPROC (say 10 * cores) makes it happen faster.

If need be, I can also provide reproduction in the original RPC library at a particular historical point, but the commands to reproduce and the interpretation are more complex. I'm hoping the below will suffice. Let me know if it does not.

I did not observe the same behavior on Linux. I did not detect the same behavior under quic-go. So it seems to be specific to MacOS and TCP sockets.

package main

import (
    "encoding/binary"
    "fmt"
    "net"
    "runtime"
    "strings"
    "time"
)

func service(conn net.Conn) {

    buff := make([]byte, 364)
    var i uint64
    var by [8]byte

    for i = 0; i < 1e8; i++ {

        by = toBytes(i)
        copy(buff[:8], by[:])

        // send sequential numbers to client, starting at 0.
        if _, err := conn.Write(buff); err != nil {
            panic(err)
        }
        runtime.Gosched()
    }
}

func listener(ln net.Listener) {
    for {
        c, err := ln.Accept()
        if err != nil {
            panic(err)
        }
        go service(c)
    }
}

func toBytes(n uint64) (r [8]byte) {
    binary.BigEndian.PutUint64(r[:], n)
    return
}

func fromBytes(by []byte) uint64 {
    return binary.BigEndian.Uint64(by)
}

func client(id int) {
    conn, err := net.Dial("tcp", "127.0.0.1:5555")
    if err != nil {
        panic(err)
    }

    buff := make([]byte, 364)
    var i uint64
    timeout := time.Millisecond

    t0 := time.Now()
    defer func() {
        fmt.Printf("elapsed since starting: %v\n", time.Since(t0))
    }()
    for i = 0; i < 1e8; i++ {

        // The repeated-read loop. Here we
        // read with timeouts; as if we also
        // need to periodically check for other events, e.g. shutdown, pause, etc.
        //
        // When we do so, we observe data loss under load.
        // Originally the library that revealed this issue used
        // 100 msec timeouts, and multiple 50 client processes.
        // Under this reproducer when we put everything in
        // one process for ease of repro, we cranked it down
        // to 1 msec to make it happen faster and more reliably.
        //
        // On darwin, at go1.23.3, the read timeouts occassionally
        // causes us to lose TCP data. Estimated around every 1 in 500K (maybe)
        // reads when under load. We could see the TCP data packets being sent
        // to the client in Wireshark, but the Go client would never get
        // delivery of the response to their RPC call.
        // MacOS Sonoma 14.0 on darwin/amd64.
        //
        // I was expecting to see a skipped number and not zero (to match
        // the original library's experience of not getting an RPC
        // reply), but in this compressed reproducer, observing
        // getting a zero instead of the next expected number
        // still qualifies as a bug. This may be as close as we can
        // easily get with this little code (the original library is much bigger).
        for {
            err := readFull(conn, buff, &timeout)
            if err != nil {
                //fmt.Printf("err = '%v'\n", err)
                r := err.Error()
                if strings.Contains(r, "i/o timeout") || strings.Contains(r, "deadline exceeded") {
                    continue // normal, expected, timeout
                }
                panic(err)
            }
            // INVAR: err == nil, so we can exit the repeated-read loop.
            break
        }
        j := fromBytes(buff[:8])
        if i != j {
            panic(fmt.Sprintf("expected next number: %v, but we got %v", i, j))
        }
        if i%100000 == 0 {
            fmt.Printf("at i = %v\n", i)
        }
    }
}

var zeroTime = time.Time{}

// readFull reads exactly len(buf) bytes from conn
func readFull(conn net.Conn, buf []byte, timeout *time.Duration) error {

    if timeout != nil && *timeout > 0 {
        conn.SetReadDeadline(time.Now().Add(*timeout))
    } else {
        // do not let previous deadlines contaminate this one.
        conn.SetReadDeadline(zeroTime)
    }

    need := len(buf)
    total := 0
    for total < len(buf) {
        n, err := conn.Read(buf[total:])
        total += n
        if total == need {
            // probably just EOF
            if err != nil {
                panic(err)
            }
            return nil
        }
        if err != nil {
            return err
        }
    }
    return nil
}

func startClients() {
    for i := 0; i < 50; i++ {
        go client(i)

        time.Sleep(time.Millisecond * 2)
    }
}

func main() {
    ln, err := net.Listen("tcp", "127.0.0.1:5555")
    if err != nil {
        panic(err)
    }
    go listener(ln)

    startClients()
    select {}
}

/*
sample output 1
...
at i = 200000
elapsed since starting: 15.691497196s
panic: expected next number: 203963, but we got 0

goroutine 111 [running]:
main.client(0x0?)
    /Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:98 +0x2fa
created by main.startClients in goroutine 1
    /Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:139 +0x3d
exit status 2

Compilation exited abnormally with code 1 at Sat Nov 16 21:28:33

sample output 2
...
at i = 400000
elapsed since starting: 35.007915405s
panic: expected next number: 420721, but we got 0

goroutine 68 [running]:
main.client(0x0?)
    /Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:98 +0x2fa
created by main.startClients in goroutine 1
    /Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:139 +0x3d
exit status 2

Compilation exited abnormally with code 1 at Sat Nov 16 21:29:57

sample output 3

at i = 100000
elapsed since starting: 12.453088565s
panic: expected next number: 148173, but we got 0

goroutine 125 [running]:
main.client(0x0?)
    /Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:105 +0x2fa
created by main.startClients in goroutine 1
    /Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:146 +0x3d
exit status 2

Compilation exited abnormally with code 1 at Sat Nov 16 21:37:24

sample output 4

...
at i = 800000
elapsed since starting: 1m10.034763283s
panic: expected next number: 829960, but we got 0

goroutine 130 [running]:
main.client(0x0?)
    /Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:112 +0x2fa
created by main.startClients in goroutine 1
    /Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:153 +0x3d
exit status 2

Compilation exited abnormally with code 1 at Sat Nov 16 21:41:49

*/

What did you see happen?

See above. It takes a variable amount of time to manifest. Sometimes 5 seconds, sometimes over a minute. Example output 4 from above:

...
at i = 800000
elapsed since starting: 1m10.034763283s
panic: expected next number: 829960, but we got 0

goroutine 130 [running]:
main.client(0x0?)
    /Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:112 +0x2fa
created by main.startClients in goroutine 1
    /Users/jaten/trash/darwin_lossy_tcp_read_timeout.go:153 +0x3d
exit status 2

Compilation exited abnormally with code 1 at Sat Nov 16 21:41:49

What did you expect to see?

Continuous read of the integers and not 0 (in the first 8 bytes) returned without error.

gabyhelp commented 4 hours ago

Related Issues

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

glycerine commented 3 hours ago

I'll record notes on how to reproduce with my RPC library, rpc25519, while they are still fresh in mind, in case the above small reproducer does not suffice.

repo: https://github.com/glycerine/rpc25519

checkout at tag v1.1.42-alpha-repro-issue-70395 (should give a branch called repro-70395)

steps:

cd to the rpc25519 directory. We will skip using the Makefile so you don't need to install the greenpack serialization library.
mkdir -p certs; go build -o cli ./cmd/cli; go build -o srv ./cmd/srv

In first terminal, run the server:

$ srv -quiet -stats 100000 -tcp -max 40 -s 127.0.0.1:9999

In 2nd terminal, run 12 clients:

$ for i in `seq 1 12`; do  cli -tcp -quiet -n 600000  -s 127.0.0.1:9999  & sleep 0.001; done

Watch first terminal for the first number after the a=c( part to go negative. When it does, it indicates that the client timed out after 10 seconds waiting for an RPC response. You can press ctrl-c to kill the server, or watch as other clients timeout. The negative numbers are simple the negative of the positive count of successful RPC calls before we saw a hang and then 10 second timeout.

The number outside the square brackets gives the number of successful RPC calls before the hang and client death. The number inside the square brackets is the goroutine number for correlation with stack traces. They are sorted in ascending order by call count, so the negatives (starved out clients) will be first.

sample output

jaten@Js-MacBook-Pro ~/go/src/github.com/glycerine/rpc25519 ((HEAD detached at ae7b060)) $ srv -quiet -stats 100000 -tcp -max 40 -s 127.0.0.1:9999
2024/11/17 04:22:16 server.go:101: rpc25519.server Start() returned serverAddr = '127.0.0.1:9999'

srv.go:343 2024-11-16 22:22:19.438 -0600 CST sendLoopGoroNum = [35] for pairID = '1'
(2.983257461s since start). count of jobs done by each client:
 a=c(1[38])

srv.go:343 2024-11-16 22:22:19.439 -0600 CST sendLoopGoroNum = [37] for pairID = '2'

srv.go:343 2024-11-16 22:22:19.439 -0600 CST sendLoopGoroNum = [39] for pairID = '3'

srv.go:343 2024-11-16 22:22:19.441 -0600 CST sendLoopGoroNum = [22] for pairID = '4'

srv.go:343 2024-11-16 22:22:19.441 -0600 CST sendLoopGoroNum = [28] for pairID = '7'

srv.go:343 2024-11-16 22:22:19.441 -0600 CST sendLoopGoroNum = [24] for pairID = '5'

srv.go:343 2024-11-16 22:22:19.442 -0600 CST sendLoopGoroNum = [30] for pairID = '8'

srv.go:343 2024-11-16 22:22:19.443 -0600 CST sendLoopGoroNum = [32] for pairID = '9'

srv.go:343 2024-11-16 22:22:19.443 -0600 CST sendLoopGoroNum = [26] for pairID = '6'

srv.go:343 2024-11-16 22:22:19.443 -0600 CST sendLoopGoroNum = [100] for pairID = '11'

srv.go:343 2024-11-16 22:22:19.443 -0600 CST sendLoopGoroNum = [98] for pairID = '10'

srv.go:343 2024-11-16 22:22:19.445 -0600 CST sendLoopGoroNum = [102] for pairID = '12'
(5.051670882s since start). count of jobs done by each client:
 a=c(3881[99], 7739[103], 7870[36], 8233[25], 8484[38], 8664[23], 8801[101], 9044[97], 9056[27], 9240[40], 9344[29], 9645[31])

(7.154977927s since start). count of jobs done by each client:
 a=c(3881[99], 12948[36], 16697[25], 17487[38], 17538[103], 17988[23], 18005[31], 18132[101], 18379[97], 19029[40], 19790[29], 20127[27])

(9.196302984s since start). count of jobs done by each client:
 a=c(3881[99], 12948[36], 26494[38], 27300[97], 27437[25], 28015[103], 28275[29], 28394[23], 28414[31], 28659[101], 29820[40], 30364[27])

(11.187412775s since start). count of jobs done by each client:
 a=c(3881[99], 12948[36], 37285[31], 37428[38], 37494[29], 37679[23], 37841[97], 38111[103], 38477[25], 38601[101], 39869[27], 40387[40])

(13.229637568s since start). count of jobs done by each client:
 a=c(3881[99], 12948[36], 46382[31], 46715[23], 47350[97], 47594[29], 48305[25], 48696[38], 48785[103], 48968[101], 49757[27], 50620[40])

srv.go:454 2024-11-16 22:22:30.401 -0600 CST server sees io.EOF from receiveMessage
(15.226440599s since start). count of jobs done by each client:
 a=c(-3881[99], 12948[36], 56835[31], 57258[97], 57686[29], 57710[103], 57816[25], 57869[38], 58334[101], 58476[23], 59752[27], 61436[40])

srv.go:454 2024-11-16 22:22:32.774 -0600 CST server sees io.EOF from receiveMessage
(17.198700926s since start). count of jobs done by each client:
 a=c(-12948[36], -3881[99], 59320[103], 67595[23], 68176[31], 68258[97], 68926[38], 69577[29], 69616[101], 69854[25], 70306[27], 71544[40])

In the last log line above, two clients (serviced by server goroutines 36 and 99) have timeout so far. One client successfully did 12948 roundtrips before hanging. The other client only got in 3881 roundtrips before hanging.

For others encountering this issue: my current workaround is to simply never use a timeout on a read. Close the connection on another goroutine to terminate the read. Sub-optimal, but it works and alleviates the starvation of clients.

golang / go