lib / pq

Pure Go Postgres driver for database/sql
https://pkg.go.dev/github.com/lib/pq
MIT License
8.89k stars 906 forks source link

data race from context cancelation and Next() in two separate goroutines #1021

Open kevinburke1 opened 3 years ago

kevinburke1 commented 3 years ago

The following Go code can produce a data race in lib/pq (it's nonsense as is because I simplified it a lot from a more complicated example.)

func getCompany(qs *sql.DB) error {
    group, errctx := errgroup.WithContext(context.Background())
    group.Go(func() error {
        var s string
        qs.QueryRowContext(errctx, "SELECT 'offline'::controller_status").Scan(&s)
        return errStopRacing
    })
    group.Go(func() error {
        var s string
        qs.QueryRowContext(errctx, "SELECT 'online'::controller_status").Scan(&s)
        return errStopRacing
    })
    group.Wait()
    return nil
}

With the following test rig:

func TestGetCompany(t *testing.T) {
    t.Parallel()
    conn := testkit.OpenDB(t)
    t.Cleanup(func() { conn.Close() })
    for j := 0; j < 20; j++ {
        var wg sync.WaitGroup
        for i := 0; i < 100; i++ {
            wg.Add(1)
            go func() {
                defer wg.Done()
                err := getCompany(conn.DB)
                if err != nil {
                    panic(err)
                }
            }()
        }
        wg.Wait()
    }
}

And the following test invocation:

GORACE="halt_on_error=1" command time go test -race -count 500 -run 'TestGetCompany' .

I think the following sequence of events is what's happening:

Thread 1: retrieves database row and invokes rows.Next() / sql.Row.Scan() Thread 1: rows.Next invokes getBad(), conn is not marked as bad Thread 1: since we are attempting to read an enum type we read directly from cn.buf in convertAssign in sql/convert.go. Thread 2 (database/sql thread): receives context cancellation in awaitDone() on sql.go line 2829. Thread 2: invokes rows.Close(). Thread 2: rows.Close invokes rows.Next Thread 2: rows.Next calls getBad, conn not marked as bad Thread 2: calls recvMessage, which attempts to write to cn.buf (by invoking io.ReadFull). This access is not synchronized with thread 1 Thread 3: (lib/pq watchCancel background thread) receives ctx.Done() message Thread 3: calls setBad()

The way you expect the code to run, Thread 3 is supposed to mark the conn as bad before Thread 1 or Thread 2 reads a database row. This happens most of the time, but not all the time, and it's easy to reproduce the error under the right circumstances.

I can't reproduce if trying to read a string type from the database instead of an enum. I think the issue is the unsynchronized calls to *rows.Next - one from normal database operation and the other from database/sql attempting to close the *Rows because the query has been canceled.

You could fix it by adding a sync.Mutex to the *Rows object and locking it while someone is calling Next(), but I'm not sure what impact that would have on memory space or performance.

kevinburke1 commented 3 years ago

Update, I can reproduce this reliably now. I think this is actually to the point where it's reasonable for someone else to try to reproduce. I am running lib/pq at tip and Go at tip. sqlc is not involved anymore.

func TestGetCompany(t *testing.T) {
    t.Parallel()
    conn := testkit.OpenDB(t)
    t.Cleanup(func() { conn.Close() })
    for j := 0; j < 20; j++ {
        var wg sync.WaitGroup
        for i := 0; i < 100; i++ {
            wg.Add(1)
            go func() {
                defer wg.Done()
                err := getCompany(conn.DB)
                if err != nil {
                    panic(err)
                }
            }()
        }
        wg.Wait()
    }
}

Source code:

func getCompany(qs *sql.DB) error {
    group, errctx := errgroup.WithContext(context.Background())
    group.Go(func() error {
        var s string
        qs.QueryRowContext(errctx, "SELECT 'online'::controller_status").Scan(&s)
        return errStopRacing
    })
    group.Go(func() error {
        var s string
        qs.QueryRowContext(errctx, "SELECT 'online'::controller_status").Scan(&s)
        return errStopRacing
    })
    group.Wait()
    return nil
}

And then in the database (Postgres), you need:

CREATE TYPE controller_status AS ENUM ('online', 'offline');

I am running with the following test invocation:

GORACE="halt_on_error=1" command time go test -race -count 500 -run 'TestGetCompany'

Here's the current data race stack trace.

==================
WARNING: DATA RACE
Write at 0x00c00027f628 by goroutine 129:
  runtime.slicecopy()
      /Users/kevin/go/src/runtime/slice.go:247 +0x0
  github.com/lib/pq.(*writeBuf).string()
      /Users/kevin/src/github.com/meterup/api/vendor/github.com/lib/pq/buf.go:69 +0x1ae
  github.com/lib/pq.(*conn).simpleQuery()
      /Users/kevin/src/github.com/meterup/api/vendor/github.com/lib/pq/conn.go:665 +0x114
  github.com/lib/pq.(*conn).query()
      /Users/kevin/src/github.com/meterup/api/vendor/github.com/lib/pq/conn.go:862 +0x61c
  github.com/lib/pq.(*conn).QueryContext()
      /Users/kevin/src/github.com/meterup/api/vendor/github.com/lib/pq/conn_go18.go:21 +0x207
  database/sql.ctxDriverQuery()
      /Users/kevin/go/src/database/sql/ctxutil.go:48 +0x2f2
  database/sql.(*DB).queryDC.func1()
      /Users/kevin/go/src/database/sql/sql.go:1651 +0x2d5
  database/sql.withLock()
      /Users/kevin/go/src/database/sql/sql.go:3295 +0x7e
  database/sql.(*DB).queryDC()
      /Users/kevin/go/src/database/sql/sql.go:1645 +0x833
  database/sql.(*DB).query()
      /Users/kevin/go/src/database/sql/sql.go:1628 +0x18a
  database/sql.(*DB).QueryContext()
      /Users/kevin/go/src/database/sql/sql.go:1605 +0xe9
  database/sql.(*DB).QueryRowContext()
      /Users/kevin/go/src/database/sql/sql.go:1707 +0xf6
  github.com/meterup/api/src/api.getCompany.func2()
      /Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:163 +0x135
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /Users/kevin/src/github.com/meterup/api/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x94

Previous read at 0x00c00027f62b by goroutine 55:
  runtime.slicebytetostring()
      /Users/kevin/go/src/runtime/string.go:80 +0x0
  database/sql.convertAssignRows()
      /Users/kevin/go/src/database/sql/convert.go:249 +0x2e84
  database/sql.(*Rows).Scan()
      /Users/kevin/go/src/database/sql/sql.go:3145 +0x37d
  database/sql.(*Row).Scan()
      /Users/kevin/go/src/database/sql/sql.go:3238 +0x244
  github.com/meterup/api/src/api.getCompany.func2()
      /Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:163 +0x18b
  database/sql.(*DB).QueryRowContext()
      /Users/kevin/go/src/database/sql/sql.go:1707 +0xf6
  github.com/meterup/api/src/api.getCompany.func2()
      /Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:163 +0x135
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /Users/kevin/src/github.com/meterup/api/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x94

Goroutine 129 (running) created at:
  golang.org/x/sync/errgroup.(*Group).Go()
      /Users/kevin/src/github.com/meterup/api/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x73
  github.com/meterup/api/src/api.getCompany()
      /Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:161 +0x244
  github.com/meterup/api/src/api.TestGetCompany.func2()
      /Users/kevin/src/github.com/meterup/api/src/api/v1_network_access_test.go:190 +0x99

Goroutine 55 (finished) created at:
  golang.org/x/sync/errgroup.(*Group).Go()
      /Users/kevin/src/github.com/meterup/api/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x73
  github.com/meterup/api/src/api.getCompany()
      /Users/kevin/src/github.com/meterup/api/src/api/v1_network_access.go:161 +0x244
  github.com/meterup/api/src/api.TestGetCompany.func2()
      /Users/kevin/src/github.com/meterup/api/src/api/v1_network_access_test.go:190 +0x99
kevinburke1 commented 3 years ago

Okay, I figured out where the issue is and updated the description in the ticket. One proposed fix would be to add a sync.Mutex to a *rows object or to the *conn object and use it to guard reads and writes from *buf but the lack of synchronization primitives makes me hesitate about that a little bit.

AgustinBanchio commented 3 years ago

Hi, I believe we are experiencing this issue as well. When the context gets cancelled sometimes the returned error is context canceled but many times it's driver: bad connection.

In my debugging attempts (and I'm not good at this) I found that one goroutine sets the connection as bad while another goroutine is closing (?) the rows.

When the second one checks for the connection status it can already have been set as bad and instead of returning a context related error it returns a bad connection error.