golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123k stars 17.54k forks source link

net: document that on some systems SetLinger causes conn.Close to block #58882

Closed StephenButtolph closed 1 year ago

StephenButtolph commented 1 year ago

What version of Go are you using (go version)?

$ go version
go version go1.20.1 linux/amd64

Does this issue reproduce with the latest release?

Yes.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/stephen/.cache/go-build"
GOENV="/home/stephen/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/stephen/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/stephen/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20.1"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build223210887=/tmp/go-build -gno-record-gcc-switches"

What did you do?

We added an explicit call to SetLinger(15) here: https://github.com/ava-labs/avalanchego/commit/1a2dca18d22a7a78e421e95dbdbe038287ee8361

What did you expect to see?

https://github.com/golang/go/blob/b94dc384cabf75e7e8703265cd80f5324f84b642/src/net/tcpsock.go#L161-L173 claims that providing SetLinger with a positive value will:

// If sec > 0, the data is sent in the background as with sec < 0. On // some operating systems after sec seconds have elapsed any remaining // unsent data may be discarded.

We expected for the OS to flush any outstanding data over the TCP stream in the background.

What did you see instead?

It doesn't seem that the data is being sent in the backaground, but that conn.Close() may block until the specified timeout.

I don't think this is actually unexpected for the behavior of SO_LINGER:

SO_LINGER Lingers on a close() if data is present. This option controls the action taken when unsent messages queue on a socket and close() is performed. If SO_LINGER is set, the system shall block the process during close() until it can transmit the data or until the time expires. If SO_LINGER is not specified, and close() is issued, the system handles the call in a way that allows the process to continue as quickly as possible. This option takes a linger structure, as defined in the <sys/socket.h> header, to specify the state of the option and linger interval.

However, I feel like the comment on SetLinger seems to contradict the man pages.

panjf2000 commented 1 year ago

UPDATE:

I did a little more in-depth research:


In fact, a socket in non-blocking mode will not block calling close() even with SO_LINGER set, and Go manages to set SOCK_NONBLOCK on each socket internally, therefore net.Conn with SO_LINGER should not block when it calls Close().

From my point of view, the comment on SetLinger might be a bit too concise and could be better worded but I would consider it accurate.

Or, are you actually experiencing the blocking from net.Conn.Close() after calling SetLinger with sec > 0?

ianlancetaylor commented 1 year ago

What is your goal here?

The default behavior is for any remaining unsent socket data to be sent in the background (the usual TCP timeouts continue to apply, so the data won't hang around indefinitely).

Calling SetLinger with a positive N seconds means that after N seconds the remaining unsent socket data to be discarded. Is that what you want?

Since Go uses non-blocking I/O, as @panjf2000 says, the behavior of SetLinger shouldn't normally affect whether Close blocks or not.

The most common use of SetLinger is to call it with 0, meaning that any unsent data will be immediately discarded.

StephenButtolph commented 1 year ago

Or, are you actually experiencing the blocking from net.Conn.Close() after calling SetLinger with sec > 0?

It seems that way (at least to me). After running dlv on the process, the Close call seems to be blocked here: https://github.com/golang/go/blob/master/src/internal/poll/fd_unix.go#L116. Which wasn't happening before our usage of SetLinger.

StephenButtolph commented 1 year ago

The default behavior is for any remaining unsent socket data to be sent in the background (the usual TCP timeouts continue to apply, so the data won't hang around indefinitely).

Calling SetLinger with a positive N seconds means that after N seconds the remaining unsent socket data to be discarded. Is that what you want?

The goal here is mainly around owning the amount of time for the server to attempt to flush the data before dropping it; rather than relying on an amorphous OS default.

panjf2000 commented 1 year ago

Or, are you actually experiencing the blocking from net.Conn.Close() after calling SetLinger with sec > 0?

It seems that way (at least to me). After running dlv on the process, the Close call seems to be blocked here: https://github.com/golang/go/blob/master/src/internal/poll/fd_unix.go#L116. Which wasn't happening before our usage of SetLinger.

If this is the case, then I believe there is nothing to do with SO_LINGER and it might be a misuse, I think #21856 may help, check if there is a similar usage in your code, concretely, is there any other places still holding the reference of that connection? which may stop the c.Close() returning.

StephenButtolph commented 1 year ago

If this is the case, then I believe there is nothing to do with SO_LINGER and it might be a misuse, I think #21856 may help, check if there is a similar usage in your code, concretely, is there any other places still holding the reference of that connection? which may stop the c.Close() returning.

Hm, there are other goroutines calling Read and Write which I'm expecting the call to Close to end up cuasing to return an error based on the Conn interface: https://github.com/golang/go/blob/master/src/net/net.go#L128-L130.

The main difference here is that Close is not blocking on the listener, but the actual TCPConn.

StephenButtolph commented 1 year ago

In fact, a socket in non-blocking mode will not block calling close() even with SO_LINGER set, and Go manages to set SOCK_NONBLOCK on each socket internally, therefore net.Conn with SO_LINGER should not block when it calls Close().

I think this may actually be the issue. It seems that on linux systems, calling close() with SO_LINGER set with a positive timeout on a SOCK_NONBLOCKed socket does actually block.

Apologies for the sketchy link... (let me know if I can send this info better somehow...) https://www.nybek.com/blog/2015/04/29/so_linger-on-non-blocking-sockets/

Setting SO_LINGER to {on, N} where N > 0 on a non-blocking socket on Linux is a particularly bad idea. Even though the socket is in non-blocking mode, this call [close] will block.

nolingerresults

panjf2000 commented 1 year ago

Hm, there are other goroutines calling Read and Write which I'm expecting the call to Close to end up cuasing to return an error based on the Conn interface: https://github.com/golang/go/blob/master/src/net/net.go#L128-L130.

Sorry for the amphibolous statement about my previous comment, your usage of Close() here is totally justified, I was talking about some more sophisticated cases of misuse.

In fact, a socket in non-blocking mode will not block calling close() even with SO_LINGER set, and Go manages to set SOCK_NONBLOCK on each socket internally, therefore net.Conn with SO_LINGER should not block when it calls Close().

I think this may actually be the issue. It seems that on linux systems, calling close() with SO_LINGER set with a positive timeout on a SOCK_NONBLOCKed socket does actually block.

Apologies for the sketchy link... (let me know if I can send this info better somehow...) https://www.nybek.com/blog/2015/04/29/so_linger-on-non-blocking-sockets/

Setting SO_LINGER to {on, N} where N > 0 on a non-blocking socket on Linux is a particularly bad idea. Even though the socket is in non-blocking mode, this call [close] will block.

nolingerresults

Despite what this link says, according to your comment, It seems that way (at least to me). After running dlv on the process, the Close call seems to be blocked here: https://github.com/golang/go/blob/master/src/internal/poll/fd_unix.go#L116. Which wasn't happening before our usage of SetLinger. before, the code there blocking you is not the system call close(2), somehow the current goroutine that is closing the connection is put into a waiting state cuz runtime believes the underlying file descriptor is not closed and will wake that goroutine, then return net.Close() after the fd is actually closed.

ianlancetaylor commented 1 year ago

The goal here is mainly around owning the amount of time for the server to attempt to flush the data before dropping it; rather than relying on an amorphous OS default.

My understanding is that the OS default is simply the TCP timeout. And my understanding is that SetLinger isn't going to increase the TCP timeout anyhow.

That said, @panjf2000 makes a good point: can you confirm that when you see a goroutine hanging in close, it is hanging specifically on the line runtime_Semacquire(&fd.csema) ? Because I don't understand how that could be affected by SetLinger.

Is there a test case we can run to recreate the problem?

StephenButtolph commented 1 year ago

My understanding is that the OS default is simply the TCP timeout. And my understanding is that SetLinger isn't going to increase the TCP timeout anyhow.

Yeah, I wouldn't consider this a pressing issue for us, we're likely to just remove our usage of SetLinger given the apparent platform dependent behavior. At this point I'm mainly more interested in figuring out what is actually happening so that others don't run into the same issues we are.

That said, @panjf2000 makes a good point: can you confirm that when you see a goroutine hanging in close, it is hanging specifically on the line runtime_Semacquire(&fd.csema) ? Because I don't understand how that could be affected by SetLinger.

Yes. Here is a trace showing that (was run using our release binaries which were built with go1.19.6):

  Goroutine 1585 - Start: /home/runner/go/pkg/mod/golang.org/x/sync@v0.1.0/errgroup/errgroup.go:72 golang.org/x/sync/errgroup.(*Group).Go.func1 (0x78a5c0) [semacquire]
     0  0x00000000004444d6 in runtime.gopark
         at /opt/hostedtoolcache/go/1.19.6/x64/src/runtime/proc.go:364
     1  0x000000000045543e in runtime.goparkunlock
         at /opt/hostedtoolcache/go/1.19.6/x64/src/runtime/proc.go:369
     2  0x000000000045543e in runtime.semacquire1
         at /opt/hostedtoolcache/go/1.19.6/x64/src/runtime/sema.go:150
     3  0x0000000000471e85 in internal/poll.runtime_Semacquire
         at /opt/hostedtoolcache/go/1.19.6/x64/src/runtime/sema.go:67
     4  0x00000000004e62ad in internal/poll.(*FD).Close
         at /opt/hostedtoolcache/go/1.19.6/x64/src/internal/poll/fd_unix.go:116
     5  0x0000000000549d18 in net.(*netFD).Close
         at /opt/hostedtoolcache/go/1.19.6/x64/src/net/fd_posix.go:37
     6  0x000000000055cea5 in net.(*conn).Close
         at /opt/hostedtoolcache/go/1.19.6/x64/src/net/net.go:207
     7  0x00000000006a730d in crypto/tls.(*Conn).Close
         at /opt/hostedtoolcache/go/1.19.6/x64/src/crypto/tls/conn.go:1374
     8  0x0000000000c10862 in github.com/ava-labs/avalanchego/network/peer.(*peer).StartClose.func1
         at /home/runner/work/avalanchego-internal/avalanchego-internal/network/peer/peer.go:311
     9  0x000000000048f082 in sync.(*Once).doSlow
         at /opt/hostedtoolcache/go/1.19.6/x64/src/sync/once.go:74
    10  0x0000000000c107e7 in sync.(*Once).Do
         at /opt/hostedtoolcache/go/1.19.6/x64/src/sync/once.go:65
    11  0x0000000000c107e7 in github.com/ava-labs/avalanchego/network/peer.(*peer).StartClose
         at /home/runner/work/avalanchego-internal/avalanchego-internal/network/peer/peer.go:310
    (truncated)

Is there a test case we can run to recreate the problem?

I'll work on getting a reproducible test case in the next couple days.

StephenButtolph commented 1 year ago

After writing a very short program I was able to replicate conn.Close hanging - but this time dlv points to syscall.Close (which seems more in-line with the reported behavior of SO_LINGER...)

goroutines.txt

I got this from running this server:

package main

import (
    "fmt"
    "net"
    "time"
)

func main() {
    fmt.Println("starting to listen")
    listener, err := net.Listen("tcp", ":7777")
    if err != nil {
        panic(err)
    }
    fmt.Println("listening")

    fmt.Println("waiting to accept")
    conn, err := listener.Accept()
    if err != nil {
        panic(err)
    }
    fmt.Println("accepted connection")

    fmt.Println("setting linger to 15s")
    tcpConn := conn.(*net.TCPConn)
    err = tcpConn.SetLinger(1500)
    if err != nil {
        panic(err)
    }
    fmt.Println("set linger to 15s")

    fmt.Println("writing some data")
    msg := make([]byte, 1<<20)
    n, err := conn.Write(msg)
    if err != nil {
        panic(err)
    }
    fmt.Printf("wrote %d bytes of data\n", n)

    startClose := time.Now()
    fmt.Println("starting to close the connection")
    err = conn.Close()
    if err != nil {
        panic(err)
    }
    fmt.Printf("closing the connection took %s\n", time.Since(startClose))
}

Along with running the client from the previously mentioned SO_LINGER tests: https://github.com/nybek/linger-tools/blob/master/linger-client.c with the arguments -i 127.0.0.1.

The output of the server should look like:

starting to listen
listening
waiting to accept
accepted connection
setting linger to 15s
set linger to 15s
writing some data
wrote 1048576 bytes of data
starting to close the connection
closing the connection took 15.082815133s

This was all run using the go version + go env listed at the beginning of the issue.

I feel like this is already a deviation from the documented behavior on SetLinger... But I'll keep looking into reproducing the hang at https://github.com/golang/go/blob/master/src/internal/poll/fd_unix.go#L116.

StephenButtolph commented 1 year ago

Here is a pure golang example that includes both the server and the client and has the same results as above:

package main

import (
    "fmt"
    "net"
    "sync"
    "time"
)

func main() {
    fmt.Println("starting to listen")
    listener, err := net.Listen("tcp", ":")
    if err != nil {
        panic(err)
    }
    fmt.Println("listening")
    defer listener.Close()

    addr := listener.Addr()

    var wg sync.WaitGroup
    wg.Add(1)
    go func() {
        fmt.Println("waiting to accept")
        conn, err := listener.Accept()
        if err != nil {
            panic(err)
        }
        fmt.Println("accepted connection")

        fmt.Println("setting linger to 15s")
        tcpConn := conn.(*net.TCPConn)
        err = tcpConn.SetLinger(15)
        if err != nil {
            panic(err)
        }
        fmt.Println("set linger to 15s")

        fmt.Println("writing some data")
        msg := make([]byte, 1<<20)
        n, err := conn.Write(msg)
        if err != nil {
            panic(err)
        }
        fmt.Printf("wrote %d bytes of data\n", n)

        startClose := time.Now()
        fmt.Println("starting to close the connection")
        err = conn.Close()
        if err != nil {
            panic(err)
        }
        fmt.Printf("closing the connection took %s\n", time.Since(startClose))
        wg.Done()
    }()

    conn, err := net.Dial("tcp", addr.String())
    if err != nil {
        panic(err)
    }
    defer conn.Close()

    wg.Wait()
}
StephenButtolph commented 1 year ago

Ok, I think I'm able to fully close the loop on this now. This program replicates the blocking on runtime_Semacquire.

Here is the output from dlv: goroutines.txt

package main

import (
    "fmt"
    "net"
    "sync"
    "time"
)

func main() {
    fmt.Println("starting to listen")
    listener, err := net.Listen("tcp", ":")
    if err != nil {
        panic(err)
    }
    fmt.Println("listening")
    defer listener.Close()

    addr := listener.Addr()

    var wg sync.WaitGroup
    wg.Add(1)
    go func() {
        fmt.Println("waiting to accept")
        conn, err := listener.Accept()
        if err != nil {
            panic(err)
        }
        fmt.Println("accepted connection")

        fmt.Println("setting linger to 15s")
        tcpConn := conn.(*net.TCPConn)
        err = tcpConn.SetLinger(1500)
        if err != nil {
            panic(err)
        }
        fmt.Println("set linger to 15s")

        go func() {
            fmt.Println("starting read")
            _, _ = conn.Read([]byte{0})
            fmt.Println("exited read")
        }()

        fmt.Println("writing some data")
        msg := make([]byte, 1<<20)
        n, err := conn.Write(msg)
        if err != nil {
            panic(err)
        }
        fmt.Printf("wrote %d bytes of data\n", n)

        startClose := time.Now()
        fmt.Println("starting to close the connection")
        err = conn.Close()
        if err != nil {
            panic(err)
        }
        fmt.Printf("closing the connection took %s\n", time.Since(startClose))
        wg.Done()
    }()

    conn, err := net.Dial("tcp", addr.String())
    if err != nil {
        panic(err)
    }
    defer conn.Close()

    wg.Wait()
}

It seems like the call to syscall.Close moved into the Read call, and conn.Close is waiting for Read to return.

ianlancetaylor commented 1 year ago

Thanks. For me that program blocks for 15 seconds in the call to syscall.Close, which is consistent with what you said earlier. I guess that Linux behaves that way, which I was not previously aware of. That makes it seems that there is no bug from Go's perspective. There isn't much that Go can do if the system call blocks.

But I guess we can mention that in the SetLinger docs.

panjf2000 commented 1 year ago

About the blocking on close(2) with SO_LINGER, see this previous comment:

UPDATE:

I did a little more in-depth research:

  • On BSD and other Unix-like OS's: calling close(2) on a non-blocking socket does not block, it returns immediately and the kernel tries sending the leftover data to the peer.
  • On Linux: calling close(2) on a non-blocking socket may block for the duration (seconds) to linger.

In fact, a socket in non-blocking mode will not block calling close() even with SO_LINGER set, and Go manages to set SOCK_NONBLOCK on each socket internally, therefore net.Conn with SO_LINGER should not block when it calls Close().

From my point of view, the comment on SetLinger might be a bit too concise and could be better worded but I would consider it accurate.

Or, are you actually experiencing the blocking from net.Conn.Close() after calling SetLinger with sec > 0?

Also, I've run the test code above on my macOS and it didn't reproduce this issue, which testifies my updated comment.

gopherbot commented 1 year ago

Change https://go.dev/cl/473915 mentions this issue: net: indicate the exeception on Linux of Close blocking with SO_LINGER

panjf2000 commented 1 year ago

Thank you for all the effort here. @StephenButtolph