database/sql: few usable connections in connection pool after network packet loss event

xshipeng commented 8 months ago

Go version

go version go1.20.6 linux/amd64

What operating system and processor architecture are you using (`go env`)?

GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/shipeng/.cache/go-build"
GOENV="/home/shipeng/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/pay/src/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/pay/src/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GOVCS=""
GOVERSION="go1.20.6"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-O2 -g"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-O2 -g"
CGO_FFLAGS="-O2 -g"
CGO_LDFLAGS="-O2 -g"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/pay/tmp/go-build1095428125=/tmp/go-build -gno-record-gcc-switches"

What did you do?

We’re using pgx v4 as driver and database/sql as connection pool for client requests and database connections. Our server and database are in different regions, causing a 50 - 60 ms network latency. We're having an issue where our host can't recover and complete client queries in time, even after network recovered from prior packet loss events. Check this link to see how to reproduce the issue with script: https://github.com/spencer-x/go-database-sql-issue#ways-to-reproduce-the-issue-in-linux-machine.

What did you expect to see?

Service host should be able to serve requests within timeout limits after network recovers.

What did you see instead?

Even after the network between the host and the database recovered, client requests continued to timeout. The number of open connections reported from the connection pool stats was considerably higher (200) than the actual number of connections reported from the PostgreSQL database (20-30).

We tried to look into the issue and found that many connections were closed during the network packet loss event. Subsequently, even after the network was recovered, pgx continued to close the connection when clients canceled the query due to context timeout. The connection opener in database/sql rapidly increased the count of db.numOpen, but the opening of new connections was sequential and relatively slow. Consequently, there were few usable connections available from the connection pools.

seankhliao commented 8 months ago

Maybe I'm not understanding the issue: is it the driver not returning database/sql/driver.ErrBadConn to inform database/sql.DB to drop the connection or something else?

xshipeng commented 8 months ago

The pgx driver returned database/sql/driver.ErrBadConn in ResetSession and many other places, so database/sql.DB knew the connection was closed.

To help further debug, I changed the variable scope in database/sql connection pool and printed them out.

We can see NumPendingOpenConn (Number of messages in OpenerCh) is constantly high even after the packet loss event is canceled.

time: 103.00013455s, err/req (97/108), NumConnClosed: 288, NumConnRequests: 296, NumPendingOpenConn: 195, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:8702 WaitDuration:5h48m24.850698435s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 104.000693703s, err/req (79/89), NumConnClosed: 289, NumConnRequests: 295, NumPendingOpenConn: 194, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:8792 WaitDuration:5h52m36.032147827s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 105.000788483s, err/req (104/114), NumConnClosed: 291, NumConnRequests: 296, NumPendingOpenConn: 195, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:8906 WaitDuration:5h58m0.505383481s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 106.000719511s, err/req (98/111), NumConnClosed: 291, NumConnRequests: 295, NumPendingOpenConn: 194, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9017 WaitDuration:6h3m15.139842187s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 107.000629773s, err/req (77/94), NumConnClosed: 291, NumConnRequests: 293, NumPendingOpenConn: 192, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9111 WaitDuration:6h7m35.933963032s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 108.000874808s, err/req (99/111), NumConnClosed: 294, NumConnRequests: 295, NumPendingOpenConn: 194, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9223 WaitDuration:6h12m47.116478313s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 109.00060339s, err/req (97/110), NumConnClosed: 294, NumConnRequests: 293, NumPendingOpenConn: 192, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9333 WaitDuration:6h17m54.046824665s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 110.000664355s, err/req (84/102), NumConnClosed: 295, NumConnRequests: 291, NumPendingOpenConn: 192, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9433 WaitDuration:6h22m30.295615995s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 111.000503913s, err/req (95/107), NumConnClosed: 300, NumConnRequests: 296, NumPendingOpenConn: 196, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9544 WaitDuration:6h27m24.105551002s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 112.000255612s, err/req (96/105), NumConnClosed: 301, NumConnRequests: 295, NumPendingOpenConn: 195, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9650 WaitDuration:6h32m27.629826103s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 113.0007683s, err/req (89/97), NumConnClosed: 301, NumConnRequests: 295, NumPendingOpenConn: 194, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9748 WaitDuration:6h37m5.057659169s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 114.000850286s, err/req (98/112), NumConnClosed: 304, NumConnRequests: 296, NumPendingOpenConn: 196, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9861 WaitDuration:6h42m14.895479289s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}
time: 115.000485688s, err/req (95/103), NumConnClosed: 306, NumConnRequests: 297, NumPendingOpenConn: 196, DB stats: {MaxOpenConnections:200 OpenConnections:200 InUse:200 Idle:0 WaitCount:9967 WaitDuration:6h47m12.84743957s MaxIdleClosed:14 MaxIdleTimeClosed:0 MaxLifetimeClosed:0}

xshipeng commented 8 months ago

One potential quick fix is to open connections in parallel like this. NumPendingOpenConn is cleared quickly with this change and the pool is recovered quickly after the packet loss event is canceled. I can help submit a PR if needed.

seankhliao commented 8 months ago

cc @kardianos

golang / go