golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.98k stars 17.67k forks source link

net: goroutine leak when when calling File on a TCPConn #24455

Closed twmb closed 6 years ago

twmb commented 6 years ago

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

$ go version
go version devel +2086f3503e Mon Mar 12 22:29:04 2018 +0000 linux/amd64

(Also tested on 1.7.6, 1.10)

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/twmb/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/twmb/go"
GORACE=""
GOROOT="/home/twmb/go/go"
GOTMPDIR=""
GOTOOLDIR="/home/twmb/go/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build789317015=/tmp/go-build -gno-record-gcc-switches"

What did you do?

File 1, the producer: https://play.golang.org/p/eNMNVTGutd9

File 2 (separate to ensure that the goroutines do not show up): https://play.golang.org/p/Mz1h9F3963o

What did you expect to see?

No goroutines hung.

What did you see instead?

$ curl 'localhost:9090/debug/pprof/goroutine?debug=1'
goroutine profile: total 3508
3504 @ 0x479e95 0x477e0f 0x476759 0x48f728 0x4f6def 0x507eaa 0x678ed6 0x566c1e 0x566d6a 0x679995 0x45b1b1
#   0x479e94    syscall.Syscall+0x4         /home/twmb/go/go/src/syscall/asm_linux_amd64.s:18
#   0x477e0e    syscall.read+0x5e           /home/twmb/go/go/src/syscall/zsyscall_linux_amd64.go:749
#   0x476758    syscall.Read+0x48           /home/twmb/go/go/src/syscall/syscall_unix.go:162
#   0x48f727    internal/poll.(*FD).Read+0x117      /home/twmb/go/go/src/internal/poll/fd_unix.go:153
#   0x4f6dee    net.(*netFD).Read+0x4e          /home/twmb/go/go/src/net/fd_unix.go:202
#   0x507ea9    net.(*conn).Read+0x69           /home/twmb/go/go/src/net/net.go:176
#   0x678ed5    net/http.(*persistConn).Read+0x135  /home/twmb/go/go/src/net/http/transport.go:1453
#   0x566c1d    bufio.(*Reader).fill+0x11d      /home/twmb/go/go/src/bufio/bufio.go:100
#   0x566d69    bufio.(*Reader).Peek+0x39       /home/twmb/go/go/src/bufio/bufio.go:132
#   0x679994    net/http.(*persistConn).readLoop+0x184  /home/twmb/go/go/src/net/http/transport.go:1601

These goroutines do not go away.

If tc.File() is commented out in the first play link, the goroutines do go away. I suspect there is a race between connections being set to non-blocking and removed from epoll management and with a goroutine falling into epoll to wait for a read.

The examples use an http server / client, but I would suspect this issue is visible with raw connections.

bcmills commented 6 years ago

https://golang.org/pkg/net/#TCPConn.File says: “It is the caller's responsibility to close f when finished. Closing c does not affect f, and closing f does not affect c.”

Is the File call paired with a Close call in the real code you've observed the leak in? If so, a more realistic reproducer would be helpful. Otherwise, it's unfortunate that the Close requirement is only enforced by documentation but there's not much we can do about it in the short term.

twmb commented 6 years ago

The File code internally calls into os's NewFile, which sets a finalizer on the fd itself to close it when it is garbage collected. However, the tc.File() line in the code above can be changed to

                f, err := tc.File() // <-- this is bad!
                chk(err)
                go func() {
                        chk(f.Close())
                }()

to ensure that not closing the file is not the issue. For ease, the following play link (which has the code) also reproduces the issue: https://play.golang.org/p/tnNDto5l8vf

fraenkel commented 6 years ago

You do realize you are calling Do(req) twice and you never close those responses.

twmb commented 6 years ago

It was a typo from modifying the same source enough to get a "minimal" reproducer. Here is an updated file: https://play.golang.org/p/v2RRGMZJAY3. This fixed playground code closes the responses the same way the prior code did in a goroutine.

I welcome other potential problems I have done, but this is a minimal reproducer that I have been trying to track down on and off for 2.5 years now. I have suspicions that the epoll implementation itself is a bit off due to other stuck goroutines I am seeing (a process that has been up for 140 days has many stuck in runtime_pollWait), but I will try to create a minimal reproducer for that separately.

fraenkel commented 6 years ago

The problem is that the socket is added with EPOLLET (edge triggered) which is no longer true.

bcmills commented 6 years ago

If you use a sync.WaitGroup instead of a spin-loop counter (https://github.com/golang/go/issues/10958) and don't hard-code the port, you can make the program runnable in the Playground.

After removing the duplicated Do call (well-spotted, @fraenkel!), I'm not seeing a leak: https://play.golang.org/p/kWWS6-gOxw9

Does that program produce a different result on your machine?

bcmills commented 6 years ago

If I remove the runtime.GC() calls, I do see a short-term goroutine leak, but it's the one described in https://github.com/golang/go/issues/21597, not the one reported above: https://play.golang.org/p/WWD7CjCSFzS

twmb commented 6 years ago

Yes, I do still see the leak. Part of the reason for the code being structured so tightly was to attempt to force the races. I ran your code; nothing leaked for one run at 1,000 connections, but leaks did occur at 10,000.

goroutine profile: total 330
163 @ 0x42e26a 0x42958a 0x428c07 0x48dcdb 0x48dd5d 0x48ebbd 0x4f1b5f 0x502c1a 0x65c518 0x561aae 0x56278c 0x5629a4 0x5ecae0 0x5ec8eb 0x656ebc 0x65d7bf 0x6614bc 0x45b211
#   0x428c06    internal/poll.runtime_pollWait+0x56     /home/twmb/go/go/src/runtime/netpoll.go:173
#   0x48dcda    internal/poll.(*pollDesc).wait+0x9a     /home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:85
#   0x48dd5c    internal/poll.(*pollDesc).waitRead+0x3c     /home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:90
#   0x48ebbc    internal/poll.(*FD).Read+0x17c          /home/twmb/go/go/src/internal/poll/fd_unix.go:157
#   0x4f1b5e    net.(*netFD).Read+0x4e              /home/twmb/go/go/src/net/fd_unix.go:202
#   0x502c19    net.(*conn).Read+0x69               /home/twmb/go/go/src/net/net.go:176
#   0x65c517    net/http.(*connReader).Read+0xf7        /home/twmb/go/go/src/net/http/server.go:764
#   0x561aad    bufio.(*Reader).fill+0x11d          /home/twmb/go/go/src/bufio/bufio.go:100
#   0x56278b    bufio.(*Reader).ReadSlice+0x2b          /home/twmb/go/go/src/bufio/bufio.go:341
#   0x5629a3    bufio.(*Reader).ReadLine+0x33           /home/twmb/go/go/src/bufio/bufio.go:370
#   0x5ecadf    net/textproto.(*Reader).readLineSlice+0x6f  /home/twmb/go/go/src/net/textproto/reader.go:55
#   0x5ec8ea    net/textproto.(*Reader).ReadLine+0x2a       /home/twmb/go/go/src/net/textproto/reader.go:36
#   0x656ebb    net/http.readRequest+0x8b           /home/twmb/go/go/src/net/http/request.go:929
#   0x65d7be    net/http.(*conn).readRequest+0x16e      /home/twmb/go/go/src/net/http/server.go:944
#   0x6614bb    net/http.(*conn).serve+0x4db            /home/twmb/go/go/src/net/http/server.go:1768

163 @ 0x479ef5 0x477e6f 0x4767b9 0x48eb58 0x4f1b5f 0x502c1a 0x673db6 0x561aae 0x561bfa 0x674875 0x45b211
#   0x479ef4    syscall.Syscall+0x4         /home/twmb/go/go/src/syscall/asm_linux_amd64.s:18
#   0x477e6e    syscall.read+0x5e           /home/twmb/go/go/src/syscall/zsyscall_linux_amd64.go:749
#   0x4767b8    syscall.Read+0x48           /home/twmb/go/go/src/syscall/syscall_unix.go:162
#   0x48eb57    internal/poll.(*FD).Read+0x117      /home/twmb/go/go/src/internal/poll/fd_unix.go:153
#   0x4f1b5e    net.(*netFD).Read+0x4e          /home/twmb/go/go/src/net/fd_unix.go:202
#   0x502c19    net.(*conn).Read+0x69           /home/twmb/go/go/src/net/net.go:176
#   0x673db5    net/http.(*persistConn).Read+0x135  /home/twmb/go/go/src/net/http/transport.go:1453
#   0x561aad    bufio.(*Reader).fill+0x11d      /home/twmb/go/go/src/bufio/bufio.go:100
#   0x561bf9    bufio.(*Reader).Peek+0x39       /home/twmb/go/go/src/bufio/bufio.go:132
#   0x674874    net/http.(*persistConn).readLoop+0x184  /home/twmb/go/go/src/net/http/transport.go:1601

Here is my current diff on your code (with the GC):

$ diff three.go his.go 
47c47
<   for n := 10000; n > 0; n-- {
---
>   for n := 1000; n > 0; n-- {
74,75d73
<   fmt.Println(l.Addr().String())
< 
82d79
<   select {}

What is most exciting for me actually is that re-running the code (with 10k connections) a few times in a row without the tc.File occasionally causes the leak I've been trying to make a reproducer for as well (edit: this one may just be from KeepAlives):

goroutine profile: total 12
6 @ 0x42e26a 0x42958a 0x428c07 0x48dcdb 0x48dd5d 0x48ebbd 0x4f1b5f 0x502c1a 0x65c518 0x561aae 0x56278c 0x5629a4 0x5ecae0 0x5ec8eb 0x656ebc 0x65d7bf 0x6614bc 0x45b211
#   0x428c06    internal/poll.runtime_pollWait+0x56     /home/twmb/go/go/src/runtime/netpoll.go:173
#   0x48dcda    internal/poll.(*pollDesc).wait+0x9a     /home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:85
#   0x48dd5c    internal/poll.(*pollDesc).waitRead+0x3c     /home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:90
#   0x48ebbc    internal/poll.(*FD).Read+0x17c          /home/twmb/go/go/src/internal/poll/fd_unix.go:157
#   0x4f1b5e    net.(*netFD).Read+0x4e              /home/twmb/go/go/src/net/fd_unix.go:202
#   0x502c19    net.(*conn).Read+0x69               /home/twmb/go/go/src/net/net.go:176
#   0x65c517    net/http.(*connReader).Read+0xf7        /home/twmb/go/go/src/net/http/server.go:764
#   0x561aad    bufio.(*Reader).fill+0x11d          /home/twmb/go/go/src/bufio/bufio.go:100
#   0x56278b    bufio.(*Reader).ReadSlice+0x2b          /home/twmb/go/go/src/bufio/bufio.go:341
#   0x5629a3    bufio.(*Reader).ReadLine+0x33           /home/twmb/go/go/src/bufio/bufio.go:370
#   0x5ecadf    net/textproto.(*Reader).readLineSlice+0x6f  /home/twmb/go/go/src/net/textproto/reader.go:55
#   0x5ec8ea    net/textproto.(*Reader).ReadLine+0x2a       /home/twmb/go/go/src/net/textproto/reader.go:36
#   0x656ebb    net/http.readRequest+0x8b           /home/twmb/go/go/src/net/http/request.go:929
#   0x65d7be    net/http.(*conn).readRequest+0x16e      /home/twmb/go/go/src/net/http/server.go:944
#   0x6614bb    net/http.(*conn).serve+0x4db            /home/twmb/go/go/src/net/http/server.go:1768

It is much more subtle, but is much more real; let me know if I should make a separate issue for that or if I should add more details here.

I recommend running the code with 10k connections and maybe on worse hardware.

fraenkel commented 6 years ago

The only "leak" I see is with

net/http.(*persistConn).readLoop(0xc422be0c60)
    /snap/go/1473/src/net/http/transport.go:1601 +0x185
created by net/http.(*Transport).dialConn
    /snap/go/1473/src/net/http/transport.go:1237 +0x95a

The problem there is that it will never break out of the read since its blocking and all deadlines are useless.

I do not see any other leaks.

The persistConn.readLoop stack, I do see when the initial stacks are dumped, but if I wait and then dump again, they are gone. The dialConn ones are still present.

twmb commented 6 years ago

I have just re-ran this by copying the code out of the play link in https://github.com/golang/go/issues/24455#issuecomment-375019236 and applied my bump-to-10k and hang-with-select mods.

I ran this and then waited 5 minutes before checking the goroutines. At 5 minutes:

1020 @ 0x4795a5 0x4774ff 0x475e49 0x48e0e9 0x4fed2f 0x5118c8 0x67e348 0x56d1a5 0x56d2fa 0x67eea8 0x45ad31
#   0x4795a4    syscall.Syscall+0x4         /home/twmb/go/go/src/syscall/asm_linux_amd64.s:18
#   0x4774fe    syscall.read+0x5e           /home/twmb/go/go/src/syscall/zsyscall_linux_amd64.go:749
#   0x475e48    syscall.Read+0x48           /home/twmb/go/go/src/syscall/syscall_unix.go:172
#   0x48e0e8    internal/poll.(*FD).Read+0x118      /home/twmb/go/go/src/internal/poll/fd_unix.go:159
#   0x4fed2e    net.(*netFD).Read+0x4e          /home/twmb/go/go/src/net/fd_unix.go:202
#   0x5118c7    net.(*conn).Read+0x67           /home/twmb/go/go/src/net/net.go:176
#   0x67e347    net/http.(*persistConn).Read+0x77   /home/twmb/go/go/src/net/http/transport.go:1453
#   0x56d1a4    bufio.(*Reader).fill+0x104      /home/twmb/go/go/src/bufio/bufio.go:100
#   0x56d2f9    bufio.(*Reader).Peek+0x39       /home/twmb/go/go/src/bufio/bufio.go:132
#   0x67eea7    net/http.(*persistConn).readLoop+0x187  /home/twmb/go/go/src/net/http/transport.go:1601

988 @ 0x42f085 0x42a4ec 0x429b86 0x48d23a 0x48d2bd 0x48e149 0x4fed2f 0x5118c8 0x666d7c 0x56d1a5 0x56de4c 0x56e064 0x5f7f50 0x5f7d4b 0x6616fc 0x668012 0x66bc1a 0x45ad31
#   0x429b85    internal/poll.runtime_pollWait+0x65     /home/twmb/go/go/src/runtime/netpoll.go:173
#   0x48d239    internal/poll.(*pollDesc).wait+0x99     /home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:85
#   0x48d2bc    internal/poll.(*pollDesc).waitRead+0x3c     /home/twmb/go/go/src/internal/poll/fd_poll_runtime.go:90
#   0x48e148    internal/poll.(*FD).Read+0x178          /home/twmb/go/go/src/internal/poll/fd_unix.go:163
#   0x4fed2e    net.(*netFD).Read+0x4e              /home/twmb/go/go/src/net/fd_unix.go:202
#   0x5118c7    net.(*conn).Read+0x67               /home/twmb/go/go/src/net/net.go:176
#   0x666d7b    net/http.(*connReader).Read+0xfb        /home/twmb/go/go/src/net/http/server.go:765
#   0x56d1a4    bufio.(*Reader).fill+0x104          /home/twmb/go/go/src/bufio/bufio.go:100
#   0x56de4b    bufio.(*Reader).ReadSlice+0x2b          /home/twmb/go/go/src/bufio/bufio.go:341
#   0x56e063    bufio.(*Reader).ReadLine+0x33           /home/twmb/go/go/src/bufio/bufio.go:370
#   0x5f7f4f    net/textproto.(*Reader).readLineSlice+0x6f  /home/twmb/go/go/src/net/textproto/reader.go:55
#   0x5f7d4a    net/textproto.(*Reader).ReadLine+0x2a       /home/twmb/go/go/src/net/textproto/reader.go:36
#   0x6616fb    net/http.readRequest+0x8b           /home/twmb/go/go/src/net/http/request.go:929
#   0x668011    net/http.(*conn).readRequest+0x161      /home/twmb/go/go/src/net/http/server.go:945
#   0x66bc19    net/http.(*conn).serve+0x4b9            /home/twmb/go/go/src/net/http/server.go:1769

I would not be surprised if your hardware is better than mine and that maybe things do not mess up on better hardware. What more can I do in this situation?

fraenkel commented 6 years ago

I can tell you that I did an strace with and without the File() call to see what the underlying difference was and I only needed <5 connections to show that the normal path would get a read failure with EAGAIN which didn't and wouldn't occur once the connect was blocking.

bcmills commented 6 years ago

@fraenkel Do you have a fix in mind for the dialConn / readLoop hang, or does this need further investigation?

twmb commented 6 years ago

Potentially related to https://github.com/golang/go/issues/24483, but I have not seen the same race trace while repeatedly running this.

twmb commented 6 years ago

Also potentially related (outside of the File call in particular): https://groups.google.com/forum/#!topic/golang-nuts/zqsC6xcnP24 https://github.com/golang/go/issues/16060 https://github.com/go-sql-driver/mysql/issues/285 https://github.com/jlaffaye/ftp/issues/112

fraenkel commented 6 years ago

@bcmills It would need further investigation but this is a slippery slope. The poller assumes edge triggered is enabled. This is very similar to #24331 which is attempting to add a new method to prevent the flip.

twmb commented 6 years ago

Fixed by #24942 (but maybe the underlying issue still exists with nonblocking connections)