jackc / pgx

PostgreSQL driver and toolkit for Go
MIT License
10.91k stars 849 forks source link

Spurious conn busy - bug in QueryRow() ? #635

Open kmansoft opened 5 years ago

kmansoft commented 5 years ago

Hello,

I've just started using your library, pretty impressed, esp. with JSON handling.

I'm seeing spurious conn busy errors on my development machine. Once they start, they never stop until I restart the app.

The app uses pgx basically from two modules, one "does something" and stores "something" into pgsql, and another is an http API which fetches this data from pgsql. The failures always occur on the http side (which obviously uses separate coroutines from the first, "does something" part).

How can I track down what's causing these errors? I already checked my code for any leaked row sets or transactions, there don't seem to be none.

Second, related to the above. What is the default connection pool size if I create my connection like this?

pgsqlConfig, err := pgx.ParseConfig("postgres://user:pass@localhost:5432/db")
if err != nil {
    ...
}

pgsqlConnection, err := pgx.ConnectConfig(context.Background(), pgsqlConfig)
if err != nil {
    ...
}

And how can I change it (connection pool size)?

kmansoft commented 5 years ago

More info:

The query that fails for me, and it was the first query executed on a particular code path, is an UPSERT like this:

_, err := model.db.Exec(context.Background(),
    `INSERT INTO device (key, push_tech, push_token)
        VALUES($1, $2, $3) 
    ON CONFLICT (key) DO UPDATE
        SET push_tech = $2, push_token = $3`,
    key, pushTech, pushToken)
if err != nil {
    return err
}

fmt.Printf("Upserted device: %s\n", key)

return nil

I added a query just above that, like this:

rows, err := model.db.Query(context.Background(),
    `SELECT _id, key, push_tech, push_token
    FROM device WHERE key = $1`, key)
if err != nil {
    return err
}
defer rows.Close()

if rows.Next() {
    device := mail.SyncDbDevice{}
    err = rows.Scan(&device.Id, &device.Key, &device.PushTech, &device.PushToken)
    if err != nil {
        return err
    }

    fmt.Printf("Selected device: %d, %s, %s, %s\n", device.Id, device.Key, device.PushTech, device.PushToken)
}

return nil

So it looks like this:

http_handler() { query() // DEBUG upsert() ..... }

Whenever the upsert starts to fail, the query just above it executes just fine, no errors.

I suppose the upsert is different from the select in that it (potentially) modifies the database, while a select does not. Still, not sure how to solve the conn busy error, would appreciate help.

kmansoft commented 5 years ago

I did some debugging, adding trace printouts to PgConn.lock and PgConn.unlock.

I believe there is a bug in Conn.QueryRow - it calls Query behind the scenes which locks the connection to read the data, but I'm unable to find where QueryRow calls rows.Close(). Normally that's the pattern I understand: rows := conn.Query()... rows.Next() ... rows.Close() and rows.Close() unlocks the connection.

But QueryRow doesn't seem to Close the rows. Perhaps I've missed it, or perhaps it really is missing.

Please comment if I'm wrong, meanwhile I'll try switching from QueryRow to Query to see if the issue goes away.

Edit: don't mind this comment. The simple reproducer case below fails with Query() the same way it fails with QueryRow() - i.e. with "conn busy".

kmansoft commented 5 years ago

One more thing re: my earlier comment on Query continuing to work even when Upsert starting throwing "conn busy".

As it turns out, Query is missing error propagation when pgConn.lock() fails (with "conn busy").

The call stack: Query -> c.pgConn.ExecPrepared -> pgConn.execExtendedPrefix -> if err := pgConn.lock(); err != nil fails but the original Query call does not return this error, instead rows.Next silently fails (returns false).

kmansoft commented 5 years ago

A simple reproducer:

wg := sync.WaitGroup{}
wg.Add(10)
for coIter := 0; coIter < 10; coIter += 1 {
    go func() {
        for nowIter := 0; nowIter < 100; nowIter += 1 {
            var nowValue time.Time
            nowError := pgsqlConnection.QueryRow(context.Background(),
                "SELECT NOW()").Scan(&nowValue)
            if nowError != nil {
                fmt.Printf("QueryRow returned error %v\n", nowError)
                break
            }
            fmt.Printf("PG now() = %s\n", nowValue)
        }
        wg.Done()
    }()
}
wg.Wait()

You should see QueryRow return error conn busy - on my machine it happens 2/3 of the runs.

It also can throw a different error:

QueryRow returned error ERROR: bind message has 2 result formats but query has 1 columns (SQLSTATE 08P01)

Maybe something goes wrong in buffer / parameter management when the conn busy error is present.

durango commented 5 years ago

@kmansoft submitted a PR that should fix your issue. In the mean time, my current workaround was to acquire the pool manually for each query and then release it manually.

kmansoft commented 5 years ago

Thank you @durango I'll be looking forward to it getting merged.

In the meantime I'm wondering if I should switch from pgx connection to pgx pool. My simple test passes with pgxpool with no errors.

Edit: no, pgxpool also fails, please see below.

https://godoc.org/github.com/jackc/pgx/pgxpool

But coming from GO's "sql" package, I was expecting pgx to perform pooled connection management behind the scenes. Is this assumption wrong? Should I always be using a pgxpool in a multi-coroutine setup? Or is that not necessary? A pgx API design question really.

kmansoft commented 5 years ago

About pgxpool. It also fails on concurrent workloads.

pgsqlConfig, err := pgxpool.ParseConfig("postgres://user:pass@localhost:5432/database")
if err != nil {
    fmt.Println(err)
    return
}

pgxPool, err := pgxpool.ConnectConfig(context.Background(), pgsqlConfig)
if err != nil {
    fmt.Println(err)
    return
}

defer func() {
    fmt.Println("Disconnecting from pgsql")
    pgxPool.Close()
}()

count := int64(0)
wg := sync.WaitGroup{}
wg.Add(10)
for coIter := 0; coIter < 20; coIter += 1 {
    go func() {
        for nowIter := 0; nowIter < 1000; nowIter += 1 {
            var nowValue time.Time
            nowError := pgxPool.QueryRow(context.Background(),
                "SELECT NOW()").Scan(&nowValue)
            if nowError != nil {
                fmt.Printf("QueryRow returned error %v\n", nowError)
                os.Exit(1)
            }
            fmt.Printf("PG now() = %s\n", nowValue)
            atomic.AddInt64(&count, 1)
        }
        wg.Done()
    }()
}
wg.Wait()
fmt.Printf("Total queries run: %d\n", count)

Result:

QueryRow returned error closed pool

The trick to triggering is that the number of coroutines needs to be 20 (10 works with no error).

durango commented 5 years ago

I was using pgxpool with Query and QueryRow and still hitting the same problem as you (also noted about the 10 vs 20 goroutines on my own so our experience seems to be very much he same). My fix wont work because the pool gets released to early for QueryRow unfortunately.

durango commented 5 years ago

@kmansoft my current working solution is to drop down to pgx's stdlib and use stdlib.AcquireConn(*sql.DB) within goroutines (then defer close that connection). So far this has been working well in production

kmansoft commented 5 years ago

@durango thanks for the tip on AcquireConn... let's see what @jackc says about the issues with higher level APIs...

jackc commented 5 years ago

Sorry, it took me a while to get to this. There are a number of questions that have come up in this issue and I'll try to answer them all in this post.

What is the default connection pool size

4 (https://github.com/jackc/pgx/blob/master/pgxpool/pool.go#L15)

And how can I change it (connection pool size)?

Set MaxConns in the pgxpool.Config struct or set it in the connection string with the pool_max_conns argument. (https://github.com/jackc/pgx/blob/master/pgxpool/pool.go#L183)

But QueryRow doesn't seem to Close the rows. Perhaps I've missed it, or perhaps it really is missing.

The underlying query started by QueryRow is closed when Scan is called on the returned Row. (https://github.com/jackc/pgx/blob/master/rows.go#L87)

But coming from GO's "sql" package, I was expecting pgx to perform pooled connection management behind the scenes. Is this assumption wrong? Should I always be using a pgxpool in a multi-coroutine setup? Or is that not necessary? A pgx API design question really.

The pgx package itself is strictly connection based. If you want a pool you need to use pgxpool (or some other syncronizatin / pooling mechanism). From the docs for pgx.Conn (https://godoc.org/github.com/jackc/pgx#Conn) "Conn is a PostgreSQL connection handle. It is not safe for concurrent usage.".

@kmansoft Regarding the failing test case with pgxpool: The reason it is failing is the line wg.Add(10). You tell the wait group to wait for 10 goroutines to complete, but then start 20. So as soon as 10 or the 20 goroutines complete the main thread blocked on wg.Wait() continues and the defer that closes the pool executes. But there are still 10 goroutines trying to use the now closed pool. This explains why you are seeing a closed pool error. If you correct that is works.

As an aside, it would be less error prone to do a wg.Add(1) inside of the loop instead of wg.Add(20) on the outside. Then there is only one place to change the number of workers.

durango commented 5 years ago

For the record, I used max pool settings this entire time (ranged from 20 - 400). Checked long-living queries [> 5 minutes] (there were none).

The underlying query started by QueryRow is closed when Scan is called on the returned Row. (https://github.com/jackc/pgx/blob/master/rows.go#L87)

Saw this yesterday actually, forgot to close my PR / was trying to resolve my pgx issues.

Tbh, I haven't really found a working solution for pgx other than using it as a connection driver. I had to drop down to sql.DB and sql.Conn. Something, somewhere, down the line connections were not being released despite closing/releasing them manually.

Fwiw I was also using v4 of pgx and have noticed less problems on v3 for pooling, but that could just be biases in observations. I've also noticed less problems when I stopped using CopyFrom and manually inserted. No code examples (since I've moved on / don't have a git history of this development) but I promise I wasn't trying anything fancier than what appears in the test files (other than the stdlib.Acquire part but that was after).

@kmansoft by any chance, are you connecting to a pgbouncer server?

kmansoft commented 5 years ago

@jackc Thank you for pointing out that "plain" pgx is not intended to be used concurrently. My bad.

With wg in the pool test fixed the test now runs without issues. Thanks again.

@durango no, not pgbouncer, just "plain" pg.

sebnyberg commented 5 years ago

Ran into the same issue when closing the acquired connection from stdlib but not properly closing pgx.BatchResults after running SendBatch.

Zamiell commented 4 years ago

This was a good read, thanks.

For anyone coming across this thread from Google, the TLDR here is that in PostgreSQL land, you need to use either:

on top of pgx, or else you will get the dreaded "conn busy" error. This is because, as jackc mentions above, the default connection size is 4. And increasing the connection size it is a bad idea because vanilla pgx is only designed to be used by one process/thread at a time.

These two blogs are recommended reading if you want to know more:

warent commented 4 years ago

It's embarrassing to admit this but I want to share why this problem happened to me in case anyone else encounters it. I have an HTTP server and was trying to re-use the same connection for all requests. If you're doing this, be sure to use a single root/main-level pool, and then acquire a new connection from the pool per incoming http request (or, if you're using GraphQL, a connection per resolver / dataloader).

holmbuar commented 4 years ago

@warent do you return a pointer to the connection from the single main-level pool connect function, or use a global var in a separate database package?

warent commented 4 years ago

You'll probably want to use a global pool var in a separate database package, and create connections from that wherever you need them.

I'm using GraphQL so my http handler and my pool are in the same file (main.go). When the graphql endpoint is called, I put a pointer to the pool in ctx which is passed into the resolvers, and then every graphql resolver creates its own connection as needed.

Both approaches seem to work fine but in retrospect a global variable would probably be cleaner.

bfontaine commented 2 years ago

Does it make sense to keep this issue open, or can we close it?

ringerc commented 1 year ago

Per my comment https://github.com/jackc/pgx/pull/750#issuecomment-1674032114 , take care before just blindly adopting pgbouncer as a proxy or reverse proxy between pgx and postgres. In session-pooling mode it won't help you much, and in transaction-pooling mode it changes the semantics of many postgres operations in ways that'll be very important for you to account for in your app's database usage. Even worse in statement-pooling mode, which is best avoided whenever possible.

If pgx works like other drivers in row-at-a-time mode, apps should try to consume the result-set promptly after a query, then close the result-set. The app should defer doing any expensive work (RPCs and other network I/O, nontrivial disk I/O, acquiring contended locks, etc) until after it reads the result and closes the result-set. Doing that should be enough to solve most problems unless the app is highly parallel and database-heavy.

teplinsky-maxim commented 1 month ago

If you came here to find out why your connection does "conn busy" always make sure you do rows.Close() after you read from them