Reconnect to SQL databases when connections fail

What changed?

Both our PostgreSQL and MySQL database backends will now automatically reconnect to the database when certain errors occur: all errors chosen have been experienced when testing this behavior through an AWS Aurora RDS failover of either MySQL or PostgreSQL.

For both backends we will reconnect when we see:

ECONNRESET
ECONNABORTED
ECONNREFUSED
io.EOF
io.ErrUnexpectedEOF
database/sql/driver.ErrBadConn

for postgres we will also reconnect on the following SQLStates:

25006 read-only transaction
57P03 cannot connect now
0A000 feature not supported, but ONLY when the message is cannot set transaction read-write mode during recovery

for mysql we will also reconnect when we see the following error codes:

1040 too many connections
1792 read-only transaction (SQLstate 25006)
1836 running in read-only mode

This logic is easily extensible should we discover more failure modes over time

Why?

We've had multiple community reports of Temporal problems during RDS failover. One part of this is the fact that we wouldn't necessarily reconnect; we were at the whims of our chosen SQL abstraction's connection pooling logic.

How did you test it?

I manually tested this functionality in the presence of repeated RDS failovers. I'm manually testing the following combinations:

[x] postgres12 plugin with pq driver
[x] postgres12 plugin with pgx driver
[x] mysql plugin

Automated testing will be added to our regular testing pipelines once our infrastructure friends have added the support I need (it's in progress)

Potential risks

We're concerned there's a correctness issue in our PostgreSQL backend that's related to our behavior during an RDS failover. If we merge this before I figure out what's going on, we could hide the issue and make it harder to reproduce.

Documentation

N/A

Is hotfix candidate?

Yes?

temporalio / temporal