Closed StephenButtolph closed 1 year ago
UPDATE:
I did a little more in-depth research:
close(2)
on a non-blocking socket does not block, it returns immediately and the kernel tries sending the leftover data to the peer.close(2)
on a non-blocking socket may block for the duration (seconds) to linger.In fact, a socket in non-blocking mode will not block calling close()
even with SO_LINGER
set, and Go manages to set SOCK_NONBLOCK
on each socket internally, therefore net.Conn
with SO_LINGER
should not block when it calls Close()
.
From my point of view, the comment on SetLinger
might be a bit too concise and could be better worded but I would consider it accurate.
Or, are you actually experiencing the blocking from net.Conn.Close()
after calling SetLinger
with sec > 0?
What is your goal here?
The default behavior is for any remaining unsent socket data to be sent in the background (the usual TCP timeouts continue to apply, so the data won't hang around indefinitely).
Calling SetLinger
with a positive N seconds means that after N seconds the remaining unsent socket data to be discarded. Is that what you want?
Since Go uses non-blocking I/O, as @panjf2000 says, the behavior of SetLinger
shouldn't normally affect whether Close
blocks or not.
The most common use of SetLinger
is to call it with 0, meaning that any unsent data will be immediately discarded.
Or, are you actually experiencing the blocking from
net.Conn.Close()
after callingSetLinger
with sec > 0?
It seems that way (at least to me). After running dlv
on the process, the Close
call seems to be blocked here: https://github.com/golang/go/blob/master/src/internal/poll/fd_unix.go#L116. Which wasn't happening before our usage of SetLinger
.
The default behavior is for any remaining unsent socket data to be sent in the background (the usual TCP timeouts continue to apply, so the data won't hang around indefinitely).
Calling
SetLinger
with a positive N seconds means that after N seconds the remaining unsent socket data to be discarded. Is that what you want?
The goal here is mainly around owning the amount of time for the server to attempt to flush the data before dropping it; rather than relying on an amorphous OS default.
Or, are you actually experiencing the blocking from
net.Conn.Close()
after callingSetLinger
with sec > 0?It seems that way (at least to me). After running
dlv
on the process, theClose
call seems to be blocked here: https://github.com/golang/go/blob/master/src/internal/poll/fd_unix.go#L116. Which wasn't happening before our usage ofSetLinger
.
If this is the case, then I believe there is nothing to do with SO_LINGER
and it might be a misuse, I think #21856 may help, check if there is a similar usage in your code, concretely, is there any other places still holding the reference of that connection? which may stop the c.Close()
returning.
If this is the case, then I believe there is nothing to do with
SO_LINGER
and it might be a misuse, I think #21856 may help, check if there is a similar usage in your code, concretely, is there any other places still holding the reference of that connection? which may stop thec.Close()
returning.
Hm, there are other goroutines calling Read
and Write
which I'm expecting the call to Close
to end up cuasing to return an error based on the Conn
interface: https://github.com/golang/go/blob/master/src/net/net.go#L128-L130.
The main difference here is that Close
is not blocking on the listener, but the actual TCPConn.
In fact, a socket in non-blocking mode will not block calling
close()
even withSO_LINGER
set, and Go manages to setSOCK_NONBLOCK
on each socket internally, thereforenet.Conn
withSO_LINGER
should not block when it callsClose()
.
I think this may actually be the issue. It seems that on linux systems, calling close()
with SO_LINGER
set with a positive timeout on a SOCK_NONBLOCK
ed socket does actually block.
Apologies for the sketchy link... (let me know if I can send this info better somehow...) https://www.nybek.com/blog/2015/04/29/so_linger-on-non-blocking-sockets/
Setting SO_LINGER to {on, N} where N > 0 on a non-blocking socket on Linux is a particularly bad idea. Even though the socket is in non-blocking mode, this call [close] will block.
Hm, there are other goroutines calling Read and Write which I'm expecting the call to Close to end up cuasing to return an error based on the Conn interface: https://github.com/golang/go/blob/master/src/net/net.go#L128-L130.
Sorry for the amphibolous statement about my previous comment, your usage of Close()
here is totally justified, I was talking about some more sophisticated cases of misuse.
In fact, a socket in non-blocking mode will not block calling
close()
even withSO_LINGER
set, and Go manages to setSOCK_NONBLOCK
on each socket internally, thereforenet.Conn
withSO_LINGER
should not block when it callsClose()
.I think this may actually be the issue. It seems that on linux systems, calling
close()
withSO_LINGER
set with a positive timeout on aSOCK_NONBLOCK
ed socket does actually block.Apologies for the sketchy link... (let me know if I can send this info better somehow...) https://www.nybek.com/blog/2015/04/29/so_linger-on-non-blocking-sockets/
Setting SO_LINGER to {on, N} where N > 0 on a non-blocking socket on Linux is a particularly bad idea. Even though the socket is in non-blocking mode, this call [close] will block.
Despite what this link says, according to your comment, It seems that way (at least to me). After running dlv on the process, the Close call seems to be blocked here: https://github.com/golang/go/blob/master/src/internal/poll/fd_unix.go#L116. Which wasn't happening before our usage of SetLinger.
before, the code there blocking you is not the system call close(2)
, somehow the current goroutine that is closing the connection is put into a waiting state cuz runtime believes the underlying file descriptor is not closed and will wake that goroutine, then return net.Close()
after the fd is actually closed.
The goal here is mainly around owning the amount of time for the server to attempt to flush the data before dropping it; rather than relying on an amorphous OS default.
My understanding is that the OS default is simply the TCP timeout. And my understanding is that SetLinger
isn't going to increase the TCP timeout anyhow.
That said, @panjf2000 makes a good point: can you confirm that when you see a goroutine hanging in close, it is hanging specifically on the line runtime_Semacquire(&fd.csema)
? Because I don't understand how that could be affected by SetLinger
.
Is there a test case we can run to recreate the problem?
My understanding is that the OS default is simply the TCP timeout. And my understanding is that
SetLinger
isn't going to increase the TCP timeout anyhow.
Yeah, I wouldn't consider this a pressing issue for us, we're likely to just remove our usage of SetLinger
given the apparent platform dependent behavior. At this point I'm mainly more interested in figuring out what is actually happening so that others don't run into the same issues we are.
That said, @panjf2000 makes a good point: can you confirm that when you see a goroutine hanging in close, it is hanging specifically on the line
runtime_Semacquire(&fd.csema)
? Because I don't understand how that could be affected bySetLinger
.
Yes. Here is a trace showing that (was run using our release binaries which were built with go1.19.6):
Goroutine 1585 - Start: /home/runner/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:72 golang.org/x/sync/errgroup.(*Group).Go.func1 (0x78a5c0) [semacquire]
0 0x00000000004444d6 in runtime.gopark
at /opt/hostedtoolcache/go/1.19.6/x64/src/runtime/proc.go:364
1 0x000000000045543e in runtime.goparkunlock
at /opt/hostedtoolcache/go/1.19.6/x64/src/runtime/proc.go:369
2 0x000000000045543e in runtime.semacquire1
at /opt/hostedtoolcache/go/1.19.6/x64/src/runtime/sema.go:150
3 0x0000000000471e85 in internal/poll.runtime_Semacquire
at /opt/hostedtoolcache/go/1.19.6/x64/src/runtime/sema.go:67
4 0x00000000004e62ad in internal/poll.(*FD).Close
at /opt/hostedtoolcache/go/1.19.6/x64/src/internal/poll/fd_unix.go:116
5 0x0000000000549d18 in net.(*netFD).Close
at /opt/hostedtoolcache/go/1.19.6/x64/src/net/fd_posix.go:37
6 0x000000000055cea5 in net.(*conn).Close
at /opt/hostedtoolcache/go/1.19.6/x64/src/net/net.go:207
7 0x00000000006a730d in crypto/tls.(*Conn).Close
at /opt/hostedtoolcache/go/1.19.6/x64/src/crypto/tls/conn.go:1374
8 0x0000000000c10862 in github.com/ava-labs/avalanchego/network/peer.(*peer).StartClose.func1
at /home/runner/work/avalanchego-internal/avalanchego-internal/network/peer/peer.go:311
9 0x000000000048f082 in sync.(*Once).doSlow
at /opt/hostedtoolcache/go/1.19.6/x64/src/sync/once.go:74
10 0x0000000000c107e7 in sync.(*Once).Do
at /opt/hostedtoolcache/go/1.19.6/x64/src/sync/once.go:65
11 0x0000000000c107e7 in github.com/ava-labs/avalanchego/network/peer.(*peer).StartClose
at /home/runner/work/avalanchego-internal/avalanchego-internal/network/peer/peer.go:310
(truncated)
Is there a test case we can run to recreate the problem?
I'll work on getting a reproducible test case in the next couple days.
After writing a very short program I was able to replicate conn.Close
hanging - but this time dlv
points to syscall.Close
(which seems more in-line with the reported behavior of SO_LINGER...)
I got this from running this server:
package main
import (
"fmt"
"net"
"time"
)
func main() {
fmt.Println("starting to listen")
listener, err := net.Listen("tcp", ":7777")
if err != nil {
panic(err)
}
fmt.Println("listening")
fmt.Println("waiting to accept")
conn, err := listener.Accept()
if err != nil {
panic(err)
}
fmt.Println("accepted connection")
fmt.Println("setting linger to 15s")
tcpConn := conn.(*net.TCPConn)
err = tcpConn.SetLinger(1500)
if err != nil {
panic(err)
}
fmt.Println("set linger to 15s")
fmt.Println("writing some data")
msg := make([]byte, 1<<20)
n, err := conn.Write(msg)
if err != nil {
panic(err)
}
fmt.Printf("wrote %d bytes of data\n", n)
startClose := time.Now()
fmt.Println("starting to close the connection")
err = conn.Close()
if err != nil {
panic(err)
}
fmt.Printf("closing the connection took %s\n", time.Since(startClose))
}
Along with running the client from the previously mentioned SO_LINGER tests: https://github.com/nybek/linger-tools/blob/master/linger-client.c with the arguments -i 127.0.0.1
.
The output of the server should look like:
starting to listen
listening
waiting to accept
accepted connection
setting linger to 15s
set linger to 15s
writing some data
wrote 1048576 bytes of data
starting to close the connection
closing the connection took 15.082815133s
This was all run using the go version + go env listed at the beginning of the issue.
I feel like this is already a deviation from the documented behavior on SetLinger
... But I'll keep looking into reproducing the hang at https://github.com/golang/go/blob/master/src/internal/poll/fd_unix.go#L116.
Here is a pure golang example that includes both the server and the client and has the same results as above:
package main
import (
"fmt"
"net"
"sync"
"time"
)
func main() {
fmt.Println("starting to listen")
listener, err := net.Listen("tcp", ":")
if err != nil {
panic(err)
}
fmt.Println("listening")
defer listener.Close()
addr := listener.Addr()
var wg sync.WaitGroup
wg.Add(1)
go func() {
fmt.Println("waiting to accept")
conn, err := listener.Accept()
if err != nil {
panic(err)
}
fmt.Println("accepted connection")
fmt.Println("setting linger to 15s")
tcpConn := conn.(*net.TCPConn)
err = tcpConn.SetLinger(15)
if err != nil {
panic(err)
}
fmt.Println("set linger to 15s")
fmt.Println("writing some data")
msg := make([]byte, 1<<20)
n, err := conn.Write(msg)
if err != nil {
panic(err)
}
fmt.Printf("wrote %d bytes of data\n", n)
startClose := time.Now()
fmt.Println("starting to close the connection")
err = conn.Close()
if err != nil {
panic(err)
}
fmt.Printf("closing the connection took %s\n", time.Since(startClose))
wg.Done()
}()
conn, err := net.Dial("tcp", addr.String())
if err != nil {
panic(err)
}
defer conn.Close()
wg.Wait()
}
Ok, I think I'm able to fully close the loop on this now. This program replicates the blocking on runtime_Semacquire
.
Here is the output from dlv
: goroutines.txt
package main
import (
"fmt"
"net"
"sync"
"time"
)
func main() {
fmt.Println("starting to listen")
listener, err := net.Listen("tcp", ":")
if err != nil {
panic(err)
}
fmt.Println("listening")
defer listener.Close()
addr := listener.Addr()
var wg sync.WaitGroup
wg.Add(1)
go func() {
fmt.Println("waiting to accept")
conn, err := listener.Accept()
if err != nil {
panic(err)
}
fmt.Println("accepted connection")
fmt.Println("setting linger to 15s")
tcpConn := conn.(*net.TCPConn)
err = tcpConn.SetLinger(1500)
if err != nil {
panic(err)
}
fmt.Println("set linger to 15s")
go func() {
fmt.Println("starting read")
_, _ = conn.Read([]byte{0})
fmt.Println("exited read")
}()
fmt.Println("writing some data")
msg := make([]byte, 1<<20)
n, err := conn.Write(msg)
if err != nil {
panic(err)
}
fmt.Printf("wrote %d bytes of data\n", n)
startClose := time.Now()
fmt.Println("starting to close the connection")
err = conn.Close()
if err != nil {
panic(err)
}
fmt.Printf("closing the connection took %s\n", time.Since(startClose))
wg.Done()
}()
conn, err := net.Dial("tcp", addr.String())
if err != nil {
panic(err)
}
defer conn.Close()
wg.Wait()
}
It seems like the call to syscall.Close
moved into the Read
call, and conn.Close
is waiting for Read
to return.
Thanks. For me that program blocks for 15 seconds in the call to syscall.Close
, which is consistent with what you said earlier. I guess that Linux behaves that way, which I was not previously aware of. That makes it seems that there is no bug from Go's perspective. There isn't much that Go can do if the system call blocks.
But I guess we can mention that in the SetLinger
docs.
About the blocking on close(2)
with SO_LINGER
, see this previous comment:
UPDATE:
I did a little more in-depth research:
- On BSD and other Unix-like OS's: calling
close(2)
on a non-blocking socket does not block, it returns immediately and the kernel tries sending the leftover data to the peer.- On Linux: calling
close(2)
on a non-blocking socket may block for the duration (seconds) to linger.In fact, a socket in non-blocking mode will not block calling
close()
even withSO_LINGER
set, and Go manages to setSOCK_NONBLOCK
on each socket internally, thereforenet.Conn
withSO_LINGER
should not block when it callsClose()
.From my point of view, the comment on
SetLinger
might be a bit too concise and could be better worded but I would consider it accurate.Or, are you actually experiencing the blocking from
net.Conn.Close()
after callingSetLinger
with sec > 0?
Also, I've run the test code above on my macOS and it didn't reproduce this issue, which testifies my updated comment.
Change https://go.dev/cl/473915 mentions this issue: net: indicate the exeception on Linux of Close blocking with SO_LINGER
Thank you for all the effort here. @StephenButtolph
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes.
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
We added an explicit call to
SetLinger(15)
here: https://github.com/ava-labs/avalanchego/commit/1a2dca18d22a7a78e421e95dbdbe038287ee8361What did you expect to see?
https://github.com/golang/go/blob/b94dc384cabf75e7e8703265cd80f5324f84b642/src/net/tcpsock.go#L161-L173 claims that providing
SetLinger
with a positive value will:We expected for the OS to flush any outstanding data over the TCP stream in the background.
What did you see instead?
It doesn't seem that the data is being sent in the backaground, but that
conn.Close()
may block until the specified timeout.I don't think this is actually unexpected for the behavior of SO_LINGER:
However, I feel like the comment on
SetLinger
seems to contradict the man pages.