golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.75k stars 17.63k forks source link

database/sql: panic in `database/sql.(*connRequestSet).deleteIndex` #68949

Closed nklaassen closed 2 months ago

nklaassen commented 2 months ago

Go version

go version go1.23.0 darwin/arm64

Output of go env in your module/workspace:

go version go1.23.0 darwin/arm64
Nics-MacBook-Pro:teleport nic$ go env
GO111MODULE=''
GOARCH='arm64'
GOBIN='/Users/nic/go/bin'
GOCACHE='/Users/nic/Library/Caches/go-build'
GOENV='/Users/nic/Library/Application Support/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='arm64'
GOHOSTOS='darwin'
GOINSECURE=''
GOMODCACHE='/Users/nic/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='darwin'
GOPATH='/Users/nic/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/Users/nic/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.0.darwin-arm64'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/Users/nic/go/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.0.darwin-arm64/pkg/tool/darwin_arm64'
GOVCS=''
GOVERSION='go1.23.0'
GODEBUG=''
GOTELEMETRY='local'
GOTELEMETRYDIR='/Users/nic/Library/Application Support/go/telemetry'
GCCGO='gccgo'
GOARM64='v8.0'
AR='ar'
CC='clang'
CXX='clang++'
CGO_ENABLED='1'
GOMOD='/Users/nic/src/teleport/go.mod'
GOWORK='/Users/nic/src/teleport/go.work'
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -arch arm64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -ffile-prefix-map=/var/folders/x_/33kp49n52rg8hs97k5m96rnw0000gn/T/go-build289495391=/tmp/go-build -gno-record-gcc-switches -fno-common'

What did you do?

One of our unit tests seems to start a transaction concurrently with closing the database, this sometimes causes a panic here: https://github.com/golang/go/blob/a4cb37d4afd4b6b386ed7b51466c8c57c6045f9c/src/database/sql/sql.go#L3644

I haven't been able to reproduce this with db.Conn db.Close calls yet, but I do have a unit test in https://github.com/golang/go/pull/68953 using database/sql.(*connRequestSet).CloseAndRemoveAll and database/sql.(*connRequestSet).Delete that reliably reproduces the panic.

Update: reproduced with exported methods on database/sql.DB here: https://go.dev/play/p/fHZFq6zgAV4

What did you see happen?

This looks like a race condition in database/sql.(*DB).conn:

  1. In (*DB).conn here delHandle is set to the result of db.connRequests.Add(req), and then db.mu is unlocked
  2. A concurrent call to db.Close will call db.connRequests.CloseAndRemoveAll here
  3. (*DB).conn re-locks db.mu here and calls db.connRequests.Delete(delHandle)
  4. delHandle.idx is positive and no bounds checking is done before trying to index into the slice which was set to nil by (*connRequestSet).CloseAndRemoveAll, resulting in the panic.

Here's a full stack trace (from https://github.com/gravitational/teleport/actions/runs/10427126860/job/28881284822):

panic: runtime error: index out of range [0] with length 0

goroutine 5494 [running]:
database/sql.(*connRequestSet).deleteIndex(...)
    /opt/go/src/database/sql/sql.go:3644
database/sql.(*connRequestSet).Delete(...)
    /opt/go/src/database/sql/sql.go:3638
database/sql.(*DB).conn(0xc001569380, {0x15fa19c0, 0xc0031d4000}, 0x1)
    /opt/go/src/database/sql/sql.go:1366 +0x1625
database/sql.(*DB).begin(0xc001569380, {0x15fa19c0, 0xc0031d4000}, 0x0, 0x1)
    /opt/go/src/database/sql/sql.go:1883 +0x57
database/sql.(*DB).BeginTx.func1(0x1)
    /opt/go/src/database/sql/sql.go:1866 +0x7a
database/sql.(*DB).retry(0x7fa4fef225c8?, 0xc002c0b2c8)
    /opt/go/src/database/sql/sql.go:1568 +0x4b
database/sql.(*DB).BeginTx(0xc001569380, {0x15fa19c0, 0xc0031d4000}, 0x0)
    /opt/go/src/database/sql/sql.go:1865 +0x108
github.com/gravitational/teleport/lib/backend/lite.(*Backend).inTransaction(0xc0033cde40, {0x15fa19c0, 0xc0031d4000}, 0xc002c0b498)
    /__w/teleport/teleport/lib/backend/lite/lite.go:949 +0x155
github.com/gravitational/teleport/lib/backend/lite.(*Backend).Get(0xc0033cde40, {0x15fa19c0, 0xc0031d4000}, {0xc003a9d440, 0x13, 0x18})
    /__w/teleport/teleport/lib/backend/lite/lite.go:598 +0x1e9
github.com/gravitational/teleport/lib/auth/storage.(*ProcessStorage).GetState(0xc0000575c0, {0x15fa19c0, 0xc0031d4000}, {0x14ac87f5, 0x5})
    /__w/teleport/teleport/lib/auth/storage/storage.go:91 +0x264
github.com/gravitational/teleport/lib/service.(*TeleportProcess).initAuthService.func5()
    /__w/teleport/teleport/lib/service/service.go:2486 +0x2db
github.com/gravitational/teleport/lib/srv.(*Heartbeat).fetch(0xc002010ea0)
    /__w/teleport/teleport/lib/srv/heartbeat.go:339 +0x56
github.com/gravitational/teleport/lib/srv.(*Heartbeat).fetchAndAnnounce(0xc002010ea0)
    /__w/teleport/teleport/lib/srv/heartbeat.go:624 +0x2f
github.com/gravitational/teleport/lib/srv.(*Heartbeat).Run(0xc002010ea0)
    /__w/teleport/teleport/lib/srv/heartbeat.go:282 +0xe8
github.com/gravitational/teleport/lib/service.(*LocalService).Serve(0xc0022c37c0)
    /__w/teleport/teleport/lib/service/supervisor.go:588 +0x36
github.com/gravitational/teleport/lib/service.(*LocalSupervisor).serve.func1()
    /__w/teleport/teleport/lib/service/supervisor.go:313 +0x489
created by github.com/gravitational/teleport/lib/service.(*LocalSupervisor).serve in goroutine 4831
    /__w/teleport/teleport/lib/service/supervisor.go:302 +0x105

What did you expect to see?

I expected no panic. sql.DB is advertised as "safe for concurrent use by multiple goroutines"

gabyhelp commented 2 months ago

Related Issues and Documentation

(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)

nklaassen commented 2 months ago

Looks like this code was introduced in https://github.com/golang/go/commit/190d0d3e69b113bea0b6b604ba2f0beb62c08741 by @bradfitz

cherrymui commented 2 months ago

cc @bradfitz @kardianos @kevinburke

gopherbot commented 2 months ago

Change https://go.dev/cl/607238 mentions this issue: database/sql: fix panic with concurrent Conn and Close

nklaassen commented 2 months ago

Update: I managed to reproduce the panic very reliably in the go playground using only exported methods https://go.dev/play/p/fHZFq6zgAV4

bradfitz commented 2 months ago

@cherrymui, this probably warrants a backport to the 1.23 branch.

cherrymui commented 2 months ago

@gopherbot please backport this to Go 1.23. This is a regression in 1.23. Thanks.

gopherbot commented 2 months ago

Backport issue(s) opened: #69041 (for 1.23).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

gopherbot commented 2 months ago

Change https://go.dev/cl/609255 mentions this issue: [release-branch.go1.23] database/sql: fix panic with concurrent Conn and Close