Open glycerine opened 4 hours ago
Related Issues
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
I'll record notes on how to reproduce with my RPC library, rpc25519, while they are still fresh in mind, in case the above small reproducer does not suffice.
repo: https://github.com/glycerine/rpc25519
checkout at tag v1.1.42-alpha-repro-issue-70395 (should give a branch called repro-70395)
steps:
cd to the rpc25519 directory. We will skip using the Makefile so you don't need to install the greenpack serialization library.
mkdir -p certs; go build -o cli ./cmd/cli; go build -o srv ./cmd/srv
In first terminal, run the server:
$ srv -quiet -stats 100000 -tcp -max 40 -s 127.0.0.1:9999
In 2nd terminal, run 12 clients:
$ for i in `seq 1 12`; do cli -tcp -quiet -n 600000 -s 127.0.0.1:9999 & sleep 0.001; done
Watch first terminal for the first number after the a=c(
part to go negative. When it does, it indicates that the client timed out after 10 seconds waiting for an RPC response. You can press ctrl-c to kill the server, or watch as other clients timeout. The negative numbers are simple the negative of the positive count of successful RPC calls before we saw a hang and then 10 second timeout.
The number outside the square brackets gives the number of successful RPC calls before the hang and client death. The number inside the square brackets is the goroutine number for correlation with stack traces. They are sorted in ascending order by call count, so the negatives (starved out clients) will be first.
sample output
jaten@Js-MacBook-Pro ~/go/src/github.com/glycerine/rpc25519 ((HEAD detached at ae7b060)) $ srv -quiet -stats 100000 -tcp -max 40 -s 127.0.0.1:9999
2024/11/17 04:22:16 server.go:101: rpc25519.server Start() returned serverAddr = '127.0.0.1:9999'
srv.go:343 2024-11-16 22:22:19.438 -0600 CST sendLoopGoroNum = [35] for pairID = '1'
(2.983257461s since start). count of jobs done by each client:
a=c(1[38])
srv.go:343 2024-11-16 22:22:19.439 -0600 CST sendLoopGoroNum = [37] for pairID = '2'
srv.go:343 2024-11-16 22:22:19.439 -0600 CST sendLoopGoroNum = [39] for pairID = '3'
srv.go:343 2024-11-16 22:22:19.441 -0600 CST sendLoopGoroNum = [22] for pairID = '4'
srv.go:343 2024-11-16 22:22:19.441 -0600 CST sendLoopGoroNum = [28] for pairID = '7'
srv.go:343 2024-11-16 22:22:19.441 -0600 CST sendLoopGoroNum = [24] for pairID = '5'
srv.go:343 2024-11-16 22:22:19.442 -0600 CST sendLoopGoroNum = [30] for pairID = '8'
srv.go:343 2024-11-16 22:22:19.443 -0600 CST sendLoopGoroNum = [32] for pairID = '9'
srv.go:343 2024-11-16 22:22:19.443 -0600 CST sendLoopGoroNum = [26] for pairID = '6'
srv.go:343 2024-11-16 22:22:19.443 -0600 CST sendLoopGoroNum = [100] for pairID = '11'
srv.go:343 2024-11-16 22:22:19.443 -0600 CST sendLoopGoroNum = [98] for pairID = '10'
srv.go:343 2024-11-16 22:22:19.445 -0600 CST sendLoopGoroNum = [102] for pairID = '12'
(5.051670882s since start). count of jobs done by each client:
a=c(3881[99], 7739[103], 7870[36], 8233[25], 8484[38], 8664[23], 8801[101], 9044[97], 9056[27], 9240[40], 9344[29], 9645[31])
(7.154977927s since start). count of jobs done by each client:
a=c(3881[99], 12948[36], 16697[25], 17487[38], 17538[103], 17988[23], 18005[31], 18132[101], 18379[97], 19029[40], 19790[29], 20127[27])
(9.196302984s since start). count of jobs done by each client:
a=c(3881[99], 12948[36], 26494[38], 27300[97], 27437[25], 28015[103], 28275[29], 28394[23], 28414[31], 28659[101], 29820[40], 30364[27])
(11.187412775s since start). count of jobs done by each client:
a=c(3881[99], 12948[36], 37285[31], 37428[38], 37494[29], 37679[23], 37841[97], 38111[103], 38477[25], 38601[101], 39869[27], 40387[40])
(13.229637568s since start). count of jobs done by each client:
a=c(3881[99], 12948[36], 46382[31], 46715[23], 47350[97], 47594[29], 48305[25], 48696[38], 48785[103], 48968[101], 49757[27], 50620[40])
srv.go:454 2024-11-16 22:22:30.401 -0600 CST server sees io.EOF from receiveMessage
(15.226440599s since start). count of jobs done by each client:
a=c(-3881[99], 12948[36], 56835[31], 57258[97], 57686[29], 57710[103], 57816[25], 57869[38], 58334[101], 58476[23], 59752[27], 61436[40])
srv.go:454 2024-11-16 22:22:32.774 -0600 CST server sees io.EOF from receiveMessage
(17.198700926s since start). count of jobs done by each client:
a=c(-12948[36], -3881[99], 59320[103], 67595[23], 68176[31], 68258[97], 68926[38], 69577[29], 69616[101], 69854[25], 70306[27], 71544[40])
In the last log line above, two clients (serviced by server goroutines 36 and 99) have timeout so far. One client successfully did 12948 roundtrips before hanging. The other client only got in 3881 roundtrips before hanging.
For others encountering this issue: my current workaround is to simply never use a timeout on a read. Close the connection on another goroutine to terminate the read. Sub-optimal, but it works and alleviates the starvation of clients.
Go version
go 1.23.3 on darwin/amd64
Output of
go env
in your module/workspace:What did you do?
On MacOS Sonoma 14.0 amd64 (Intel) architecture, I observe occassional TCP data loss when reading from a net.Conn with a read deadline.
I recorded packets and can see their delivery to the client (at the OS level) in Wireshark, but the Go client would not receive that data. The expected data was the response to an RPC call. So therefore my client would, at times, timeout waiting for the call response. This caused me to investigate. I've attempted to boil it down to a minimal reproducer, shown below. This only was observable under load; I needed to run at least 12 clients on my 4 core mac to start getting losses. A larger value of GOMAXPROC (say 10 * cores) makes it happen faster.
If need be, I can also provide reproduction in the original RPC library at a particular historical point, but the commands to reproduce and the interpretation are more complex. I'm hoping the below will suffice. Let me know if it does not.
I did not observe the same behavior on Linux. I did not detect the same behavior under quic-go. So it seems to be specific to MacOS and TCP sockets.
What did you see happen?
See above. It takes a variable amount of time to manifest. Sometimes 5 seconds, sometimes over a minute. Example output 4 from above:
What did you expect to see?
Continuous read of the integers and not 0 (in the first 8 bytes) returned without error.