jackc / pgx

PostgreSQL driver and toolkit for Go
MIT License
10.32k stars 816 forks source link

Pool triggerHealthCheck leaves stranded goroutines for 500ms #1641

Open bgentry opened 1 year ago

bgentry commented 1 year ago

Describe the bug After releasing all conns back to a pgxpool.Pool and closing the pool, there can still be goroutines left running. That's because triggerHealthCheck gets called upon Release(), and it spawns a goroutine which begins by sleeping 500ms no matter what.

This means there is no way to guarantee that a pgxpool is fully shut down with no remaining remnants, at least not without sleeping a non-trivial amount of time.

To Reproduce

I can try to put together a reproduction if it helps. I believe this is triggered by:

  1. Running goleak in your test program
  2. Starting up a pgxpool
  3. Acquiring a conn
  4. Releasing the conn
  5. Closing the pool
  6. Observing that goleak sometimes detects open/leaked goroutines when the program exits.

Expected behavior Closing a pgxpool.Pool should leave behind no remnants and no open goroutines.

Actual behavior Because there is a hardcoded 500ms sleep that doesn't terminate when i.e. the pool closes sooner, goroutines are sometimes left running after the pool is closed.

Version

Additional context I did not see any past issues discussing this, so I wanted to at least confirm that it was an intentional design choice and not merely an oversight or bug.

jackc commented 1 year ago

Because there is a hardcoded 500ms sleep that doesn't terminate when i.e. the pool closes sooner, goroutines are sometimes left running after the pool is closed.

But the goroutine terminates after that 500ms, right? Does this behavior cause a problem?

bgentry commented 1 year ago

@jackc I think the primary issues are:

  1. This goes against the expectations a Go programmer has that when they close a resource that it is fully closed. This expectation is IMO strengthened by the docs for Pool.Close() which imply that all resources have been released by the time the method returns.
  2. When using goleak in test suites to verify that no goroutines are being leaked by the code being tested, pgx generates false positives with one "leaked" goroutine per conn pool that was recently closed:

    goleak: Errors on successful test run: found unexpected goroutines:
    [Goroutine 21898 in state sleep, with time.Sleep on top of the stack:
    goroutine 21898 [sleep]:
    time.Sleep(0x1dcd6500)
    /opt/hostedtoolcache/go/1.20.4/x64/src/runtime/time.go:195 +0x135
    github.com/jackc/pgx/v5/pgxpool.(*Pool).triggerHealthCheck.func1()
    /home/runner/go/pkg/mod/github.com/jackc/pgx/v5@v5.3.1/pgxpool/pool.go:379 +0x[31]
    created by github.com/jackc/pgx/v5/pgxpool.(*Pool).triggerHealthCheck
    /home/runner/go/pkg/mod/github.com/jackc/pgx/v5@v5.3.1/pgxpool/pool.go:376 +0x8f
    
    Goroutine 21883 in state sleep, with time.Sleep on top of the stack:
    goroutine 21883 [sleep]:
    time.Sleep(0x1dcd6500)
    /opt/hostedtoolcache/go/1.20.4/x64/src/runtime/time.go:195 +0x135
    github.com/jackc/pgx/v5/pgxpool.(*Pool).triggerHealthCheck.func1()
    /home/runner/go/pkg/mod/github.com/jackc/pgx/v5@v5.3.1/pgxpool/pool.go:379 +0x31
    created by github.com/jackc/pgx/v5/pgxpool.(*Pool).triggerHealthCheck
    /home/runner/go/pkg/mod/github.com/jackc/pgx/v5@v5.3.1/pgxpool/pool.go:[37]

    In order to avoid these false positives, a user of both pgx and goleak needs to add (a) a 500ms+ sleep at the end of their tests, or (b) this exception when invoking goleak:

    var knownGoroutineLeaks = []goleak.Option{ //nolint:gochecknoglobals
    // This goroutine contains a 500 ms uninterruptable sleep that may still be
    // running by the time the test suite finishes and cause a failure.
    goleak.IgnoreTopFunction("github.com/jackc/pgx/v5/pgxpool.(*Pool).backgroundHealthCheck"),
    }

    This exception is also prone to breaking due to pgx internal refactors, so it's a hacky solution.

From a brief look at the code in pool.go, I'm wondering if instead of an uninterruptible time.Sleep(500*time.Millisecond), the goroutine in triggerHealthCheck() could be refactored to use a timer that is properly canceled & drained when that goroutine exits, and then select both that timer and p.closeChan? Something like:

t := time.NewTimer(500*time.Millisecond)

select {
case <-p.closeChan:
    if !t.Stop() {
        <-t.C
    }
    return
case <-t.C:
    select {
    case p.healthCheckChan <- struct{}{}:
    default:
    }
}

This of course doesn't 100% solve the fact that the goroutine could still be running for a brief moment after the return of Pool.Close(); that would probably require a deeper refactor, particularly a way to confirm that this and any other goroutines have exited.

I also see that elsewhere in the code you're doing something similar to the above pattern, except using a <-time.After(). This is another place where a goroutine can be left in a leaked state for up to 500ms. This could also benefit from the above pattern, or you could extract it into a slightly cleaner form with something like this: https://go.dev/play/p/pH0jHo2RZaL

jackc commented 1 year ago

🤷‍♂️

This goes against the expectations a Go programmer has that when they close a resource that it is fully closed. This expectation is IMO strengthened by the docs for Pool.Close() which imply that all resources have been released by the time the method returns.

The docs say all connections are closed, and they are. The fact that health checks are implemented via a background goroutine is an implementation detail that isn't exposed through the public interface.

When using goleak in test suites to verify that no goroutines are being leaked by the code being tested, pgx generates false positives with one "leaked" goroutine per conn pool that was recently closed:

That is unfortunate, but I view it as akin to checking memory usage as soon as a value has gone out of scope. The GC may not have run yet. Once a variable is out of scope or a Pool is closed it's not visible to the user application unless runtime introspection is used.

I don't mind if someone wants to refactor / improve this, but it's not something I personally plan on working on.

bgentry commented 1 year ago

@jackc I opened #1642 to at least try to improve these situations, although it doesn't fully resolve the potential goleak issue. At least the workaround is documented here in case anybody goes looking for it.

If in spite of the PR you don't feel this is worth fixing, no worries, feel free to close this and the PR. Thanks for your great work on pgx regardless :v: