golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123k stars 17.54k forks source link

net: more detailed TCP keep-alive configuration #62254

Closed neild closed 7 months ago

neild commented 1 year ago

TCP keep-alives are enabled by setting the SO_KEEPALIVE socket option. When keep-alives are enabled, the operating system will probe idle connections and close connections with an unresponsive peer.

Linux, FreeBSD, and Windows permit configuring keep-alive probes with the TCP_KEEPIDLE, TCP_KEEPINTVL, and TCP_KEEPCNT socket options. (Darwin appears to call TCP_KEEPIDLE TCP_KEEPALIVE.) These options set the idle duration before the first probe is sent, the interval between probes, and the maximum number of probes to send before declaring a connection dead, respectively.

The net package permits configuring TCP keepalive behavior with:

Setting the keep-alive period sets both TCP_KEEPIDLE and TCP_KEEPINTVL. There is no way in the net package to set TCP_KEEPCNT or otherwise configure how long it takes to declare a connection dead. There is no way to set different values for the idle period and the probe interval.

Since Go 1.12 (https://go.dev/cl/107196), the net package has set TCP_KEEPIDLE and TCP_KEEPINTVL on all new sockets to 15 seconds by default.

This 15 second value is not appropriate for all uses. For example, #48622 complains that the default keep-alive settings cause excessive CPU consumption on mobile devices due to frequent probes. Reducing the time until the first keep-alive probe is sent with Dialer.KeepAlive also reduces the time between probes; setting the period to five minutes will cause the kernel to take ~50 minutes to detect a dead connection.

I propose that we change the net package functions which set the keep-alive period to set TCP_KEEPIDLE, but not TCP_KEEPINTVL. It is surprising and not useful for a "keep-alive period" of ten minutes to translate into a connection being declared dead only after several hours of unresponsiveness.

I propose that we also add a finer-grained API for configuring keep-alive behavior. I haven't exhaustively surveyed the operating systems we support, but Linux, Windows, Darwin, and at least one BSD all appear to support the same three settings. It's possible to set these via golang.org/x/sys, but providing a common way to configure them in the net package seems reasonable.

The following API also permits for enabling keep-alives on a connection while using the operating system default keep-alive configuration (set KeepAliveConfig.Enable, set the other fields to -1).

package net

// KeepAliveConfig contains TCP keep-alive options.
//
// If the Idle, Interval, or Count fields are zero, a default value is chosen.
// If a field is negative, the operating system default is used.
type KeepAliveConfig struct {
        // If Enable is true, keep-alive probes are enabled.
        Enable bool

        // Idle is the time that the connection must be idle before
        // the first keep-alive probe is sent.
        Idle time.Duration

        // Interval is the time between keep-alive probes.
        Interval time.Duration

        // Count is the maximum number of keep-alive probes that
        // should be sent before dropping a connection.
        Count int
}

type Dialer { // contains existing unchanged fields
        // KeepAlive specifies the interval between keep-alive
        // probes for an active network connection.
        //
        // KeepAlive is ignored if KeepAliveConfig.Enable is true.
        //
        // If zero, keep-alive probes are sent with a default value
        // (currently 15 seconds), if supported by the protocol and operating
        // system. Network protocols or operating systems that do
        // not support keep-alives ignore this field.
        // If negative and KeepAliveConfig.Enable is false, keep-alive probes are disabled.
        KeepAlive time.Duration

        // KeepAliveConfig specifies the keep-alive probe configuration
        // for an active network connection, when supported by the
        // protocol and operating system.
        //
        // If KeepAliveConfig.Enable is true, keep-alives probes are enabled.
        // If KeepAliveConfig.Enable is false and KeepAlive is negative,
        // keep-alive probes are disabled.
        KeepAliveConfig KeepAliveConfig
}

type ListenConfig {
        // same fields as Dialer above
}

// SetKeepAliveConfig configures keep-alive messages sent by the operating system.
func (c *TCPConn) SetKeepAliveConfig(config KeepAliveConfig) error {}
neild commented 1 year ago

I found the following survey of keep-alive behaviors to be useful: https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/

That post also covers the TCP_USER_TIMEOUT socket option, which configures another aspect of how dead TCP connections are detected. I'm not certain if anything other than Linux supports that option, so I've left it out of this proposal.

bcmills commented 1 year ago

Presumably SetKeepAliveConfig would return a non-nil error if the configured keepalive behavior can't be provided on the current platform?

neild commented 1 year ago

Presumably SetKeepAliveConfig would return a non-nil error if the configured keepalive behavior can't be provided on the current platform?

That was my thought. If a caller doesn't mind a platform not supporting an option, they can just not check the returned error.

panjf2000 commented 1 year ago

Presumably SetKeepAliveConfig would return a non-nil error if the configured keepalive behavior can't be provided on the current platform?

Maybe we should return an error that wraps errors. ErrUnsupported from #41198

rsc commented 11 months ago

This proposal has been added to the active column of the proposals project and will now be reviewed at the weekly proposal review meetings. — rsc for the proposal review group

rsc commented 11 months ago

This seems reasonable, well-motivated, and well-justified. Thank you. I think you can drop the "and KeepAliveConfig.Enable is false" in the KeepAlive docs, since it already said the field is ignored when Enable==true.

It would help to say what the current defaults are in the config struct docs.

And in Count maybe "should be sent" should be "can go unanswered".

neild commented 10 months ago

Updated proposal with the suggested changes.

I needed to pick default values for KeepAliveConfig fields. I went with the current behavior for Idle and Interval (15 seconds), and the Linux kernel default for Count (9). We could also say that the default for Count matches the current behavior of picking the OS default, but it seems less surprising for us to set all three.

package net

// KeepAliveConfig contains TCP keep-alive options.
//
// If the Idle, Interval, or Count fields are zero, a default value is chosen.
// If a field is negative, the operating system default is used.
type KeepAliveConfig struct {
        // If Enable is true, keep-alive probes are enabled.
        Enable bool

        // Idle is the time that the connection must be idle before
        // the first keep-alive probe is sent.
        // If zero, a default value of 15 seconds is used.
        Idle time.Duration

        // Interval is the time between keep-alive probes.
        // If zero, a default value of 15 seconds is used.
        Interval time.Duration

        // Count is the maximum number of keep-alive probes that
        // can go unanswered before dropping a connection.
        // If zero, a default value of 9 is used.
        Count int
}

type Dialer { // contains existing unchanged fields
        // KeepAlive specifies the interval between keep-alive
        // probes for an active network connection.
        //
        // KeepAlive is ignored if KeepAliveConfig.Enable is true.
        //
        // If zero, keep-alive probes are sent with a default value
        // (currently 15 seconds), if supported by the protocol and operating
        // system. Network protocols or operating systems that do
        // not support keep-alives ignore this field.
        // If negative, keep-alive probes are disabled.
        KeepAlive time.Duration

        // KeepAliveConfig specifies the keep-alive probe configuration
        // for an active network connection, when supported by the
        // protocol and operating system.
        //
        // If KeepAliveConfig.Enable is true, keep-alives probes are enabled.
        // If KeepAliveConfig.Enable is false and KeepAlive is negative,
        // keep-alive probes are disabled.
        KeepAliveConfig KeepAliveConfig
}

type ListenConfig {
        // same fields as Dialer above
}

// SetKeepAliveConfig configures keep-alive messages sent by the operating system.
func (c *TCPConn) SetKeepAliveConfig(config KeepAliveConfig) error {}
rsc commented 10 months ago

Based on the discussion above, this proposal seems like a likely accept. — rsc for the proposal review group

Proposal details are in https://github.com/golang/go/issues/62254#issuecomment-1791102281.

rsc commented 10 months ago

No change in consensus, so accepted. 🎉 This issue now tracks the work of implementing the proposal. — rsc for the proposal review group

Proposal details are in https://github.com/golang/go/issues/62254#issuecomment-1791102281.

panjf2000 commented 10 months ago

I've done some investigations about TCP keep-alive on various platforms and had a CL for this proposal drafted, I'd volunteer to implement this if you don't mind.

gopherbot commented 10 months ago

Change https://go.dev/cl/542275 mentions this issue: net: add KeepAliveConfig and implement SetKeepAliveConfig

odeke-em commented 9 months ago

Thanks everyone for the discourse! Looks like it is too late in the cycle to get in, forward rolling to Go1.23.

panjf2000 commented 9 months ago

Thanks everyone for the discourse! Looks like it is too late in the cycle to get in, forward rolling to Go1.23.

Actually, CL 542275 had gained all votes and been ready to be submitted the day before Go 1.22 release freeze, it just didn't get a chance to be submitted somehow. Given that the pain point in #48622 has been reported for over two years and the corresponding CL did finish the work on schedule, I'd like to propose a freeze exception of Go1.22 for this, cc @golang/release. Thanks!