Destructive Postgres error codes should empty the pool

avanov commented 1 year ago

Hi, Nikita

There are situations when all existing connections in a pool have to be dropped and re-acquired again, because Postgres has a class of unrecoverable errors that make all open connections invalid. They originate on a DB server side and all clients have to actively manage these cases, as otherwise the connections will fail in a loop, and depending on the pool size it may lead to connection throttling due to error rate metrics in SaaS DB deployments. AWS RDS in particular is famous for two things: throwing "PAM Authentication failed" connection errors once your service starts hitting the DB with queries that cannot succeeed, and enforcing maintenance mode DB restarts/upgrades that client applications using connection pools should be able to recover from without restart/redeployment.

Python's SQLalchemy mentions it vaguely in one of its API sections, but if you look at the sources of its Postgres adapter you'll find that there's a hilariously stringly decision-making going on under the hood. This is precisely for the reason described above.

We can do it better, thanks to PostgreSQL.ErrorCodes. In one of my production services I have a collection of the following destructive error codes with a few exceptional cases not covered by PostgreSQL.ErrorCodes:


-- https://www.postgresql.org/docs/9.4/errcodes-appendix.html
defaultDestructiveCodes :: [PGErr.ErrorCode]
defaultDestructiveCodes = [PGErr.admin_shutdown, PGErr.crash_shutdown, PGErr.cannot_connect_now]

-- | these are the templates of strings that should be checked for in provided textual reasons of the returned errors
--   before we decide that it's safe to keep the existing connections in the pool
doNotEmptyPoolCriteria :: [ByteString]
doNotEmptyPoolCriteria = ["PAM authentication", "timeout expired"]

It's supposed to work like this, I hope the comments clarify it further:


result <- query
case result of
    Left err -> case err of
        ConnectionError (Just reason) -> do
            -- this section isn't encoded by Postgres Error codes and oftentimes is vendor-specific,
            -- so we use empirically collected set of substrings of error reasons that may indicate
            -- non-destructive behaviour. For instance AWS's "PAM authentication" errors shouldn't
            -- invalidate all open connections, they just need to be back-pressured by the client app
            -- using the pool, so that AWS is happy with its internal rate-limiters.
            -- Dropping all existing connections in this case would just make the situation worse.
            unless (any (`isInfixOf` reason) doNotEmptyPoolCriteria)
                <RELEASE_THE_POOL>

        -- No reason was provided by the DB server. It'd be safer to release the entire connection pool
        ConnectionError Nothing -> do
            <RELEASE_THE_POOL>

        -- https://hackage.haskell.org/package/hasql-1.4.4.2/docs/Hasql-Session.html#t:CommandError
        -- The same as above but this time libpq may complain due to a socket drop or a similar inconsistency,
        -- so we clear the pool in this situation as well
        (SessionError (QueryError tpl params (ClientError cerr))) -> do
            <RELEASE_THE_POOL>

        -- https://hackage.haskell.org/package/hasql-1.4.4.2/docs/Hasql-Session.html#t:ResultError
        (SessionError (QueryError tpl params (ResultError (ServerError code msg details hint)))) -> do
                -- one of the error codes may indicate a destructive change on the DB server that
                -- should invalidate all open connections
                when (code `elem` destructiveCodes)
                    <RELEASE_THE_POOL>

        -- result errors other than 'ServerError' indicate that the DB communicated back correctly but there's an error
        -- either with the query or with the return value.
        (SessionError (QueryError tpl params (ResultError rerr))) -> do
            <PROCEED_NORMALLY>

I propose the pool constructor to accept one more callable predicate that would allow users to provide a similar set of rules for matching against failed queries, that would allow clearing the pool completely in case there's a destructive error code from the DB server. I don't think it has to be hardcoded as SQLAlchemy did, as different SaaS DB vendors have different rate-limiting criteria and error responses and it would be nice to give end users a freedom to pick between their own custom strategy and something pre-defined by the library. I think the error codes above make for a nice default implementation of the predicate that I've successfully used in production on AWS RDS for a while.

nikita-volkov commented 1 year ago

Thanks for detailed info on this!

It doesn't seem like there is a need to make this a configuration parameter. Instead have you considered implementing this as a custom wrapper of Pool's use function on your end? Something like the following:

useAutoResetting :: Pool -> Session a -> IO (Either UsageError a)
useAutoResetting pool session =
  Pool.use pool session >>= \case
    Right result -> return $ Right result
    Left err -> if errIsDestructive err
      then do
        Pool.release pool
        return $ Left err
      else return $ Left err
  where
    errIsDestructive =
      error "TODO"

avanov commented 1 year ago

Yes, that's what I'm doing at the moment. I don't mind continuing with it, but because I am looking for a way to migrate from my current reliance on resource-pool, I thought this would be nice to share along the way. It could be a separate package on top, but the pervasiveness of the issue suggests that all users may actually want to have it at all times without knowing it yet, and so it may be of a direct benefit for hasql-pool to have it. Java's most popular HikariCP has it too.

Not insisting, just a food for thought and a reference point for search engines, in case someone else is looking for solutions for their problems.

nikita-volkov commented 1 year ago

Given that there's no universal way of classifying such errors, this will result in complication of the configuration. Most users won't be happy about that. Given also the fact that this is resolvable with a wrapper, it doesn't seem necessary to expand the area of responsibility of "hasql-pool" by bundling such solution. Given also how trivial it seems to implement such a wrapper, to me it looks best to just keep it at the application-level or pack it into some extension or wrapper lib. Bottom line, I'd rather not take on that as a maintenance responsibility as part of this library.

nikita-volkov commented 1 year ago

Closing since the discussion stopped. Feel free to reopen.

avanov commented 1 year ago

Thanks. I may publish a separate package for this later.

nikita-volkov / hasql-pool

Destructive Postgres error codes should empty the pool #27