jackc / pgx

PostgreSQL driver and toolkit for Go
MIT License
10.62k stars 836 forks source link

Connection pool deadlocks with v5 (regression from v4) #1354

Closed benweint closed 1 year ago

benweint commented 1 year ago

Describe the bug We're seeing an issue wherein we can consistently reproduce application hangs with pgx/v5 that do not occur with pgx/v4. The only changes to the application itself between the pgx/v4-based version that works and the pgx/v5 version that doesn't are those necessitated by the API differences between the two versions (namely, pgxpool.ConnectConfig vs. pgxpool.NewWithConfig).

To Reproduce

I don't have a compact and easily shareable reproduction case yet, but here's the outline of what we're doing:

  1. Deploy an application that uses pgx/v5 to read from a Postgres 14 database. In our case, the DB happens to be running on GCP's CloudSQL service, and the DB in question is configured using CloudSQL's high-availability feature.
  2. Confirm that the application is working as expected (able to read from the DB). In our case, all of the application's reads happen via the QueryRow method.
  3. Initiate a CloudSQL database failover as described here.
  4. Attempt to exercise the application again, and observe that DB calls all hang under the following stack trace:
goroutine 43048 [semacquire, 2 minutes]:
sync.runtime_SemacquireMutex(0x101f64a68?, 0x20?, 0x14000e38c08?)
    /Users/ben/.asdf/installs/golang/1.19/go/src/runtime/sema.go:77 +0x24
sync.(*Mutex).lockSlow(0x14000464618)
    /Users/ben/.asdf/installs/golang/1.19/go/src/sync/mutex.go:171 +0x178
sync.(*Mutex).Lock(0x14000e38d08?)
    /Users/ben/.asdf/installs/golang/1.19/go/src/sync/mutex.go:90 +0x7c
github.com/jackc/puddle/v2.(*Pool[...]).Acquire(0x140001e08c0, {0x10135cf28, 0x14002b31320})
    /Users/ben/go/pkg/mod/github.com/jackc/puddle/v2@v2.0.0/pool.go:413 +0x2a4
github.com/jackc/pgx/v5/pgxpool.(*Pool).Acquire(0x1400018c3f0, {0x10135cf28, 0x14002b31320})
    /Users/ben/go/pkg/mod/github.com/jackc/pgx/v5@v5.0.2/pgxpool/pool.go:496 +0x50
github.com/jackc/pgx/v5/pgxpool.(*Pool).QueryRow(0x1400115c48d?, {0x10135cf28, 0x14002b31320}, {0x100fc7b14, 0x4d}, {0x140019994e0, 0x1, 0x1})
    /Users/ben/go/pkg/mod/github.com/jackc/pgx/v5@v5.0.2/pgxpool/pool.go:623 +0x38
< ... snip - app code below here ... >

From what I can tell, all of the request handling goroutines are blocked under this same stack, and there don't seem to be any that are actually holding DB connections.

I've also been able to reproduce this running entirely locally by interposing Toxiproxy in between the application and the DB, and adding a timeout toxic to the proxied connection. Even after this toxic is removed, the application will not recover once it's gotten into this state.

Interestingly, the application does recover if the duration of the toxic being in place is short enough (a few seconds). Even with a high rate of incoming requests (500 reqs/s with a concurrency of 500), it seems to take anywhere from 10 - 60 seconds for it to get stuck like this, suggesting a timing-dependent issue of some kind.

I'm hoping to be able to winnow this down to an easily-shareable test case soon.

Expected behavior

I would expect for the application to be impacted while the DB is unreachable or unresponsive, but that the pool should become usable again once the DB recovers.

Actual behavior

If the DB unreachability goes on for long enough, the pool will eventually enter a state wherein any goroutine that attempts to Acquire a connection hangs indefinitely.

Version

benweint commented 1 year ago

Ok, here's a test case:

package main

import (
    "context"
    "fmt"
    "github.com/jackc/pgx/v5/pgxpool"
    "os"
    "sync"
    "time"
)

func main() {
    // urlExample := "postgres://username:password@localhost:5432/database_name"
    ctx := context.Background()

    cfg, err := pgxpool.ParseConfig(os.Getenv("DATABASE_URL"))
    if err != nil {
        panic(err)
    }

    cfg.MinConns = 100
    cfg.MaxConns = 100

    pool, err := pgxpool.NewWithConfig(context.Background(), cfg)
    if err != nil {
        panic(err)
    }
    defer pool.Close()

    concurrency := 100
    var wg sync.WaitGroup

    for i := 0; i < concurrency; i++ {
        wg.Add(1)
        go runQueries(ctx, fmt.Sprintf("%d", i), pool)
    }

    wg.Wait()
}

func runQueries(ctx context.Context, id string, pool *pgxpool.Pool) {
    for {
        runOnce(ctx, id, pool)
    }
}

func runOnce(ctx context.Context, id string, pool *pgxpool.Pool) {
    ctx, cancel := context.WithTimeout(ctx, 1*time.Second)
    defer cancel()

    var one int
    err := pool.QueryRow(ctx, "SELECT 1").Scan(&one)
    if err != nil {
        fmt.Printf("got error in %s: %v\n", id, err)
        return
    }
    fmt.Printf("ok %s!\n", id)
}

Run with:

DATABASE_URL=postgres://<user>:<pass>@localhost:<port>/<db> go run .

If connected directly to a DB, works fine. To reproduce this issue, the easiest thing I've found so far is to use Toxiproxy:

Start the server with toxiproxy-server. Create a proxy listening on some port, pointing to your local Postgres:

toxiproxy-cli create --listen localhost:5435 --upstream localhost:5432 pg

Run the test case again, pointed at the proxy listening port (5435 in this example) instead of directly at the DB. Verify that it works (you should see lots of ok! messages).

Once it's been running for a bit, add a timeout toxic like this:

toxiproxy-cli toxic add --type timeout --attribute timeout=3 --toxicName to pg

Let the test case continue running, and watch it eventually grind to a halt. You can remove the toxic like this:

toxiproxy-cli toxic remove --toxicName to pg   

... but the app is now dead and won't recover from this state.

benweint commented 1 year ago

Same test case with v4 works fine, in the sense that when the toxic is removed, it recovers.

benweint commented 1 year ago

I'm actually able to get this to reproduce even more easily (without the need for Toxiproxy) by just running the test case about with the Go race detector turned on. Even without messing with the DB connection, it reliably hangs after a few seconds in this mode on my machine.

jackc commented 1 year ago

@benweint Can you try this using puddle v2.1.0? I think it resolves the issue.

benweint commented 1 year ago

Thanks @jackc - it looks like puddle 2.1.0 does indeed make this test case work!

We had actually tried out the fix in https://github.com/jackc/puddle/pull/24 in a production-like environment (mirrored production traffic) and saw some bad connection-related behavior with that change, but it's possible that that was misdiagnosed or caused by something else at the application layer.

We should be able to try again with puddle v2.1.0 in that same environment next week to confirm whether what we saw recurs or if it's resolved.

(cc @jorgerasillo, who will probably be the one to try this next week)

jon-whit commented 1 year ago

We're encountering this issue as well 👍

jorgerasillo commented 1 year ago

Hi @jackc, we've been running w/ puddle v2.1.0 for a little over 24hrs and have not seen this issue occur. We actually triggered a series of failovers and the application has been able to recover successfully.

Thanks for the quick turn around!

jon-whit commented 1 year ago

@jackc any reason not to issue a new release with puddle v2.1.0?

jackc commented 1 year ago

any reason not to issue a new release with puddle v2.1.0?

puddle v2.1.0 requires Go 1.19 and pgx needs to support Go 1.18 until Go 1.20 is released.

hochhaus commented 1 year ago

any reason not to issue a new release with puddle v2.1.0?

puddle v2.1.0 requires Go 1.19 and pgx needs to support Go 1.18 until Go 1.20 is released.

Thanks for your work @jackc.

I understand the rationale but this is unfortunate. Should a clear note be added to the README for users of v5 pgxpool that either they need to use puddle v2.1.0 or stick to using v4 until go 1.20 is released?

jackc commented 1 year ago

I've reverted the change in puddle that requires Go 1.19 at the expense of adding a dependency to go.uber.org/atomic. I had hoped to avoid that, but given v2.1.0+ has an important fix it's worth allowing a dependency for now.

pgx v5.1.0 now depends on the puddle version with the fix.