Failing to reconnect to database on instance size change

nikita-volkov / hasql-pool

A pool of connections for Hasql

http://hackage.haskell.org/package/hasql-pool

MIT License

17 stars 15 forks source link

Failing to reconnect to database on instance size change #15

Closed istathar closed 2 years ago

istathar commented 2 years ago

Don't have a lot of evidence for this, but we're encountered a very strange outage today when we resized an Amazon RDS instance and wanted to let you know.

The Haskell service talking to that database suddenly started chucking 100% errors due to failed queries and transactions.

The problem went away when we restarted the Haskell program.

Our legacy system (which uses a different Postgres library) survived changes to the RDS instances no problem. We speculate that it somehow either noticed the failed connections and reconnected (or worse was being reconnecting all the time - stupid, but works around this issue potentially).

In the new system we're using hasql via hasql-pool.

Is this of interest? If so we can certainly try to provide more details.

istathar commented 2 years ago

The telemetry shown here has us making the database change at about 10:05; shortly after it came back we started seeing the purple "no connection to server" errors.

The problem went away when we restarted the Haskell program. Weird, right?

So we're wondering if there was some condition we need to detect and manually remove the connections from the pool if we see it?

nikita-volkov commented 2 years ago

Which version of pool are you using? Have you tried the latest version?

istathar commented 2 years ago

Looks like we're on 0.5.2.2; we'll try upgrading to 0.7.2.1.

nikita-volkov commented 2 years ago

Has this been resolved?

periodic commented 2 years ago

I was just digging into this. We were also on 0.5.2.2.

I found that Hasql gives out "no connection to the server" errors as ClientError which is a type of CommandError and gets emitted as a QueryError. hasql-pool does not destroy the resource when it sees QueryErrors (because you don't want to drop the connection on every failed query) and so it never drops connections when they can't connect.

I'll also see if we can update and whether the current logic is better about this.

istathar commented 2 years ago

Has this been resolved?

@nikita-volkov Not sure; we've upgraded to 0.7.2.1 and our system continues to preform great so thank you! but we haven't done a database infrastructure change event yet so can't confirm that the bug is cleared.

@periodic's nice analysis! We didn't get that deep when looking at this.

I'd say go ahead and close this if you're comfortable you understand why hasql-pool was encountering it. Thanks so much Nikita.

nikita-volkov commented 2 years ago

I was just digging into this. We were also on 0.5.2.2.

I found that Hasql gives out "no connection to the server" errors as ClientError which is a type of CommandError and gets emitted as a QueryError. hasql-pool does not destroy the resource when it sees QueryErrors (because you don't want to drop the connection on every failed query) and so it never drops connections when they can't connect.

I'll also see if we can update and whether the current logic is better about this.

This is exactly the issue that the latest releases resolve. See #6.

Thanks guys. Closing this as resolved. Feel free to reopen in case your issues remain.