Closed twmb closed 6 years ago
https://golang.org/pkg/net/#TCPConn.File says: “It is the caller's responsibility to close f when finished. Closing c does not affect f, and closing f does not affect c.”
Is the File
call paired with a Close
call in the real code you've observed the leak in? If so, a more realistic reproducer would be helpful. Otherwise, it's unfortunate that the Close
requirement is only enforced by documentation but there's not much we can do about it in the short term.
The File code internally calls into os's NewFile, which sets a finalizer on the fd itself to close it when it is garbage collected. However, the tc.File()
line in the code above can be changed to
f, err := tc.File() // <-- this is bad!
chk(err)
go func() {
chk(f.Close())
}()
to ensure that not closing the file is not the issue. For ease, the following play link (which has the code) also reproduces the issue: https://play.golang.org/p/tnNDto5l8vf
You do realize you are calling Do(req) twice and you never close those responses.
It was a typo from modifying the same source enough to get a "minimal" reproducer. Here is an updated file: https://play.golang.org/p/v2RRGMZJAY3. This fixed playground code closes the responses the same way the prior code did in a goroutine.
I welcome other potential problems I have done, but this is a minimal reproducer that I have been trying to track down on and off for 2.5 years now. I have suspicions that the epoll implementation itself is a bit off due to other stuck goroutines I am seeing (a process that has been up for 140 days has many stuck in runtime_pollWait), but I will try to create a minimal reproducer for that separately.
The problem is that the socket is added with EPOLLET (edge triggered) which is no longer true.
If you use a sync.WaitGroup
instead of a spin-loop counter (https://github.com/golang/go/issues/10958) and don't hard-code the port, you can make the program runnable in the Playground.
After removing the duplicated Do
call (well-spotted, @fraenkel!), I'm not seeing a leak:
https://play.golang.org/p/kWWS6-gOxw9
Does that program produce a different result on your machine?
If I remove the runtime.GC()
calls, I do see a short-term goroutine leak, but it's the one described in https://github.com/golang/go/issues/21597, not the one reported above:
https://play.golang.org/p/WWD7CjCSFzS
Yes, I do still see the leak. Part of the reason for the code being structured so tightly was to attempt to force the races. I ran your code; nothing leaked for one run at 1,000 connections, but leaks did occur at 10,000.
goroutine profile: total 330
163 @ 0x42e26a 0x42958a 0x428c07 0x48dcdb 0x48dd5d 0x48ebbd 0x4f1b5f 0x502c1a 0x65c518 0x561aae 0x56278c 0x5629a4 0x5ecae0 0x5ec8eb 0x656ebc 0x65d7bf 0x6614bc 0x45b211
# 0x428c06 internal/poll.runtime_pollWait+0x56 /home/twmb/go/go/src/runtime/netpoll.go:173
# 0x48dcda internal/poll.(*pollDesc).wait+0x9a /home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:85
# 0x48dd5c internal/poll.(*pollDesc).waitRead+0x3c /home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:90
# 0x48ebbc internal/poll.(*FD).Read+0x17c /home/twmb/go/go/src/internal/poll/fd_unix.go:157
# 0x4f1b5e net.(*netFD).Read+0x4e /home/twmb/go/go/src/net/fd_unix.go:202
# 0x502c19 net.(*conn).Read+0x69 /home/twmb/go/go/src/net/net.go:176
# 0x65c517 net/http.(*connReader).Read+0xf7 /home/twmb/go/go/src/net/http/server.go:764
# 0x561aad bufio.(*Reader).fill+0x11d /home/twmb/go/go/src/bufio/bufio.go:100
# 0x56278b bufio.(*Reader).ReadSlice+0x2b /home/twmb/go/go/src/bufio/bufio.go:341
# 0x5629a3 bufio.(*Reader).ReadLine+0x33 /home/twmb/go/go/src/bufio/bufio.go:370
# 0x5ecadf net/textproto.(*Reader).readLineSlice+0x6f /home/twmb/go/go/src/net/textproto/reader.go:55
# 0x5ec8ea net/textproto.(*Reader).ReadLine+0x2a /home/twmb/go/go/src/net/textproto/reader.go:36
# 0x656ebb net/http.readRequest+0x8b /home/twmb/go/go/src/net/http/request.go:929
# 0x65d7be net/http.(*conn).readRequest+0x16e /home/twmb/go/go/src/net/http/server.go:944
# 0x6614bb net/http.(*conn).serve+0x4db /home/twmb/go/go/src/net/http/server.go:1768
163 @ 0x479ef5 0x477e6f 0x4767b9 0x48eb58 0x4f1b5f 0x502c1a 0x673db6 0x561aae 0x561bfa 0x674875 0x45b211
# 0x479ef4 syscall.Syscall+0x4 /home/twmb/go/go/src/syscall/asm_linux_amd64.s:18
# 0x477e6e syscall.read+0x5e /home/twmb/go/go/src/syscall/zsyscall_linux_amd64.go:749
# 0x4767b8 syscall.Read+0x48 /home/twmb/go/go/src/syscall/syscall_unix.go:162
# 0x48eb57 internal/poll.(*FD).Read+0x117 /home/twmb/go/go/src/internal/poll/fd_unix.go:153
# 0x4f1b5e net.(*netFD).Read+0x4e /home/twmb/go/go/src/net/fd_unix.go:202
# 0x502c19 net.(*conn).Read+0x69 /home/twmb/go/go/src/net/net.go:176
# 0x673db5 net/http.(*persistConn).Read+0x135 /home/twmb/go/go/src/net/http/transport.go:1453
# 0x561aad bufio.(*Reader).fill+0x11d /home/twmb/go/go/src/bufio/bufio.go:100
# 0x561bf9 bufio.(*Reader).Peek+0x39 /home/twmb/go/go/src/bufio/bufio.go:132
# 0x674874 net/http.(*persistConn).readLoop+0x184 /home/twmb/go/go/src/net/http/transport.go:1601
Here is my current diff on your code (with the GC):
$ diff three.go his.go
47c47
< for n := 10000; n > 0; n-- {
---
> for n := 1000; n > 0; n-- {
74,75d73
< fmt.Println(l.Addr().String())
<
82d79
< select {}
What is most exciting for me actually is that re-running the code (with 10k connections) a few times in a row without the tc.File
occasionally causes the leak I've been trying to make a reproducer for as well (edit: this one may just be from KeepAlives):
goroutine profile: total 12
6 @ 0x42e26a 0x42958a 0x428c07 0x48dcdb 0x48dd5d 0x48ebbd 0x4f1b5f 0x502c1a 0x65c518 0x561aae 0x56278c 0x5629a4 0x5ecae0 0x5ec8eb 0x656ebc 0x65d7bf 0x6614bc 0x45b211
# 0x428c06 internal/poll.runtime_pollWait+0x56 /home/twmb/go/go/src/runtime/netpoll.go:173
# 0x48dcda internal/poll.(*pollDesc).wait+0x9a /home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:85
# 0x48dd5c internal/poll.(*pollDesc).waitRead+0x3c /home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:90
# 0x48ebbc internal/poll.(*FD).Read+0x17c /home/twmb/go/go/src/internal/poll/fd_unix.go:157
# 0x4f1b5e net.(*netFD).Read+0x4e /home/twmb/go/go/src/net/fd_unix.go:202
# 0x502c19 net.(*conn).Read+0x69 /home/twmb/go/go/src/net/net.go:176
# 0x65c517 net/http.(*connReader).Read+0xf7 /home/twmb/go/go/src/net/http/server.go:764
# 0x561aad bufio.(*Reader).fill+0x11d /home/twmb/go/go/src/bufio/bufio.go:100
# 0x56278b bufio.(*Reader).ReadSlice+0x2b /home/twmb/go/go/src/bufio/bufio.go:341
# 0x5629a3 bufio.(*Reader).ReadLine+0x33 /home/twmb/go/go/src/bufio/bufio.go:370
# 0x5ecadf net/textproto.(*Reader).readLineSlice+0x6f /home/twmb/go/go/src/net/textproto/reader.go:55
# 0x5ec8ea net/textproto.(*Reader).ReadLine+0x2a /home/twmb/go/go/src/net/textproto/reader.go:36
# 0x656ebb net/http.readRequest+0x8b /home/twmb/go/go/src/net/http/request.go:929
# 0x65d7be net/http.(*conn).readRequest+0x16e /home/twmb/go/go/src/net/http/server.go:944
# 0x6614bb net/http.(*conn).serve+0x4db /home/twmb/go/go/src/net/http/server.go:1768
It is much more subtle, but is much more real; let me know if I should make a separate issue for that or if I should add more details here.
I recommend running the code with 10k connections and maybe on worse hardware.
The only "leak" I see is with
net/http.(*persistConn).readLoop(0xc422be0c60)
/snap/go/1473/src/net/http/transport.go:1601 +0x185
created by net/http.(*Transport).dialConn
/snap/go/1473/src/net/http/transport.go:1237 +0x95a
The problem there is that it will never break out of the read since its blocking and all deadlines are useless.
I do not see any other leaks.
The persistConn.readLoop stack, I do see when the initial stacks are dumped, but if I wait and then dump again, they are gone. The dialConn ones are still present.
I have just re-ran this by copying the code out of the play link in https://github.com/golang/go/issues/24455#issuecomment-375019236 and applied my bump-to-10k and hang-with-select mods.
I ran this and then waited 5 minutes before checking the goroutines. At 5 minutes:
1020 @ 0x4795a5 0x4774ff 0x475e49 0x48e0e9 0x4fed2f 0x5118c8 0x67e348 0x56d1a5 0x56d2fa 0x67eea8 0x45ad31
# 0x4795a4 syscall.Syscall+0x4 /home/twmb/go/go/src/syscall/asm_linux_amd64.s:18
# 0x4774fe syscall.read+0x5e /home/twmb/go/go/src/syscall/zsyscall_linux_amd64.go:749
# 0x475e48 syscall.Read+0x48 /home/twmb/go/go/src/syscall/syscall_unix.go:172
# 0x48e0e8 internal/poll.(*FD).Read+0x118 /home/twmb/go/go/src/internal/poll/fd_unix.go:159
# 0x4fed2e net.(*netFD).Read+0x4e /home/twmb/go/go/src/net/fd_unix.go:202
# 0x5118c7 net.(*conn).Read+0x67 /home/twmb/go/go/src/net/net.go:176
# 0x67e347 net/http.(*persistConn).Read+0x77 /home/twmb/go/go/src/net/http/transport.go:1453
# 0x56d1a4 bufio.(*Reader).fill+0x104 /home/twmb/go/go/src/bufio/bufio.go:100
# 0x56d2f9 bufio.(*Reader).Peek+0x39 /home/twmb/go/go/src/bufio/bufio.go:132
# 0x67eea7 net/http.(*persistConn).readLoop+0x187 /home/twmb/go/go/src/net/http/transport.go:1601
988 @ 0x42f085 0x42a4ec 0x429b86 0x48d23a 0x48d2bd 0x48e149 0x4fed2f 0x5118c8 0x666d7c 0x56d1a5 0x56de4c 0x56e064 0x5f7f50 0x5f7d4b 0x6616fc 0x668012 0x66bc1a 0x45ad31
# 0x429b85 internal/poll.runtime_pollWait+0x65 /home/twmb/go/go/src/runtime/netpoll.go:173
# 0x48d239 internal/poll.(*pollDesc).wait+0x99 /home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:85
# 0x48d2bc internal/poll.(*pollDesc).waitRead+0x3c /home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:90
# 0x48e148 internal/poll.(*FD).Read+0x178 /home/twmb/go/go/src/internal/poll/fd_unix.go:163
# 0x4fed2e net.(*netFD).Read+0x4e /home/twmb/go/go/src/net/fd_unix.go:202
# 0x5118c7 net.(*conn).Read+0x67 /home/twmb/go/go/src/net/net.go:176
# 0x666d7b net/http.(*connReader).Read+0xfb /home/twmb/go/go/src/net/http/server.go:765
# 0x56d1a4 bufio.(*Reader).fill+0x104 /home/twmb/go/go/src/bufio/bufio.go:100
# 0x56de4b bufio.(*Reader).ReadSlice+0x2b /home/twmb/go/go/src/bufio/bufio.go:341
# 0x56e063 bufio.(*Reader).ReadLine+0x33 /home/twmb/go/go/src/bufio/bufio.go:370
# 0x5f7f4f net/textproto.(*Reader).readLineSlice+0x6f /home/twmb/go/go/src/net/textproto/reader.go:55
# 0x5f7d4a net/textproto.(*Reader).ReadLine+0x2a /home/twmb/go/go/src/net/textproto/reader.go:36
# 0x6616fb net/http.readRequest+0x8b /home/twmb/go/go/src/net/http/request.go:929
# 0x668011 net/http.(*conn).readRequest+0x161 /home/twmb/go/go/src/net/http/server.go:945
# 0x66bc19 net/http.(*conn).serve+0x4b9 /home/twmb/go/go/src/net/http/server.go:1769
I would not be surprised if your hardware is better than mine and that maybe things do not mess up on better hardware. What more can I do in this situation?
I can tell you that I did an strace with and without the File() call to see what the underlying difference was and I only needed <5 connections to show that the normal path would get a read failure with EAGAIN which didn't and wouldn't occur once the connect was blocking.
@fraenkel Do you have a fix in mind for the dialConn
/ readLoop
hang, or does this need further investigation?
Potentially related to https://github.com/golang/go/issues/24483, but I have not seen the same race trace while repeatedly running this.
Also potentially related (outside of the File
call in particular):
https://groups.google.com/forum/#!topic/golang-nuts/zqsC6xcnP24
https://github.com/golang/go/issues/16060
https://github.com/go-sql-driver/mysql/issues/285
https://github.com/jlaffaye/ftp/issues/112
@bcmills It would need further investigation but this is a slippery slope. The poller assumes edge triggered is enabled. This is very similar to #24331 which is attempting to add a new method to prevent the flip.
Fixed by #24942 (but maybe the underlying issue still exists with nonblocking connections)
Please answer these questions before submitting your issue. Thanks!
What version of Go are you using (
go version
)?(Also tested on 1.7.6, 1.10)
Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?What did you do?
File 1, the producer: https://play.golang.org/p/eNMNVTGutd9
File 2 (separate to ensure that the goroutines do not show up): https://play.golang.org/p/Mz1h9F3963o
What did you expect to see?
No goroutines hung.
What did you see instead?
These goroutines do not go away.
If
tc.File()
is commented out in the first play link, the goroutines do go away. I suspect there is a race between connections being set to non-blocking and removed from epoll management and with a goroutine falling into epoll to wait for a read.The examples use an http server / client, but I would suspect this issue is visible with raw connections.