Open kevinburke1 opened 3 years ago
Update, I can reproduce this reliably now. I think this is actually to the point where it's reasonable for someone else to try to reproduce. I am running lib/pq at tip and Go at tip. sqlc is not involved anymore.
func TestGetCompany(t *testing.T) {
t.Parallel()
conn := testkit.OpenDB(t)
t.Cleanup(func() { conn.Close() })
for j := 0; j < 20; j++ {
var wg sync.WaitGroup
for i := 0; i < 100; i++ {
wg.Add(1)
go func() {
defer wg.Done()
err := getCompany(conn.DB)
if err != nil {
panic(err)
}
}()
}
wg.Wait()
}
}
Source code:
func getCompany(qs *sql.DB) error {
group, errctx := errgroup.WithContext(context.Background())
group.Go(func() error {
var s string
qs.QueryRowContext(errctx, "SELECT 'online'::controller_status").Scan(&s)
return errStopRacing
})
group.Go(func() error {
var s string
qs.QueryRowContext(errctx, "SELECT 'online'::controller_status").Scan(&s)
return errStopRacing
})
group.Wait()
return nil
}
And then in the database (Postgres), you need:
CREATE TYPE controller_status AS ENUM ('online', 'offline');
I am running with the following test invocation:
GORACE="halt_on_error=1" command time go test -race -count 500 -run 'TestGetCompany'
Here's the current data race stack trace.
==================
WARNING: DATA RACE
Write at 0x00c00027f628 by goroutine 129:
runtime.slicecopy()
/Users/kevin/go/src/runtime/slice.go:247 +0x0
github.com/lib/pq.(*writeBuf).string()
/Users/kevin/src/github.com/meterup/api/vendor/github.com/lib/pq/buf.go:69 +0x1ae
github.com/lib/pq.(*conn).simpleQuery()
/Users/kevin/src/github.com/meterup/api/vendor/github.com/lib/pq/conn.go:665 +0x114
github.com/lib/pq.(*conn).query()
/Users/kevin/src/github.com/meterup/api/vendor/github.com/lib/pq/conn.go:862 +0x61c
github.com/lib/pq.(*conn).QueryContext()
/Users/kevin/src/github.com/meterup/api/vendor/github.com/lib/pq/conn_go18.go:21 +0x207
database/sql.ctxDriverQuery()
/Users/kevin/go/src/database/sql/ctxutil.go:48 +0x2f2
database/sql.(*DB).queryDC.func1()
/Users/kevin/go/src/database/sql/sql.go:1651 +0x2d5
database/sql.withLock()
/Users/kevin/go/src/database/sql/sql.go:3295 +0x7e
database/sql.(*DB).queryDC()
/Users/kevin/go/src/database/sql/sql.go:1645 +0x833
database/sql.(*DB).query()
/Users/kevin/go/src/database/sql/sql.go:1628 +0x18a
database/sql.(*DB).QueryContext()
/Users/kevin/go/src/database/sql/sql.go:1605 +0xe9
database/sql.(*DB).QueryRowContext()
/Users/kevin/go/src/database/sql/sql.go:1707 +0xf6
github.com/meterup/api/src/api.getCompany.func2()
/Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:163 +0x135
golang.org/x/sync/errgroup.(*Group).Go.func1()
/Users/kevin/src/github.com/meterup/api/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x94
Previous read at 0x00c00027f62b by goroutine 55:
runtime.slicebytetostring()
/Users/kevin/go/src/runtime/string.go:80 +0x0
database/sql.convertAssignRows()
/Users/kevin/go/src/database/sql/convert.go:249 +0x2e84
database/sql.(*Rows).Scan()
/Users/kevin/go/src/database/sql/sql.go:3145 +0x37d
database/sql.(*Row).Scan()
/Users/kevin/go/src/database/sql/sql.go:3238 +0x244
github.com/meterup/api/src/api.getCompany.func2()
/Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:163 +0x18b
database/sql.(*DB).QueryRowContext()
/Users/kevin/go/src/database/sql/sql.go:1707 +0xf6
github.com/meterup/api/src/api.getCompany.func2()
/Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:163 +0x135
golang.org/x/sync/errgroup.(*Group).Go.func1()
/Users/kevin/src/github.com/meterup/api/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x94
Goroutine 129 (running) created at:
golang.org/x/sync/errgroup.(*Group).Go()
/Users/kevin/src/github.com/meterup/api/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x73
github.com/meterup/api/src/api.getCompany()
/Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:161 +0x244
github.com/meterup/api/src/api.TestGetCompany.func2()
/Users/kevin/src/github.com/meterup/api/src/api/v1_network_access_test.go:190 +0x99
Goroutine 55 (finished) created at:
golang.org/x/sync/errgroup.(*Group).Go()
/Users/kevin/src/github.com/meterup/api/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x73
github.com/meterup/api/src/api.getCompany()
/Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:161 +0x244
github.com/meterup/api/src/api.TestGetCompany.func2()
/Users/kevin/src/github.com/meterup/api/src/api/v1_network_access_test.go:190 +0x99
Okay, I figured out where the issue is and updated the description in the ticket. One proposed fix would be to add a sync.Mutex
to a *rows
object or to the *conn
object and use it to guard reads and writes from *buf
but the lack of synchronization primitives makes me hesitate about that a little bit.
Hi, I believe we are experiencing this issue as well. When the context gets cancelled sometimes the returned error is context canceled
but many times it's driver: bad connection
.
In my debugging attempts (and I'm not good at this) I found that one goroutine sets the connection as bad while another goroutine is closing (?) the rows.
When the second one checks for the connection status it can already have been set as bad and instead of returning a context related error it returns a bad connection error.
The following Go code can produce a data race in lib/pq (it's nonsense as is because I simplified it a lot from a more complicated example.)
With the following test rig:
And the following test invocation:
I think the following sequence of events is what's happening:
Thread 1: retrieves database row and invokes rows.Next() / sql.Row.Scan() Thread 1: rows.Next invokes getBad(), conn is not marked as bad Thread 1: since we are attempting to read an enum type we read directly from cn.buf in convertAssign in sql/convert.go. Thread 2 (database/sql thread): receives context cancellation in awaitDone() on sql.go line 2829. Thread 2: invokes rows.Close(). Thread 2: rows.Close invokes rows.Next Thread 2: rows.Next calls getBad, conn not marked as bad Thread 2: calls recvMessage, which attempts to write to cn.buf (by invoking io.ReadFull). This access is not synchronized with thread 1 Thread 3: (lib/pq watchCancel background thread) receives ctx.Done() message Thread 3: calls setBad()
The way you expect the code to run, Thread 3 is supposed to mark the conn as bad before Thread 1 or Thread 2 reads a database row. This happens most of the time, but not all the time, and it's easy to reproduce the error under the right circumstances.
I can't reproduce if trying to read a string type from the database instead of an enum. I think the issue is the unsynchronized calls to
*rows.Next
- one from normal database operation and the other from database/sql attempting to close the*Rows
because the query has been canceled.You could fix it by adding a
sync.Mutex
to the*Rows
object and locking it while someone is callingNext()
, but I'm not sure what impact that would have on memory space or performance.