Both our PostgreSQL and MySQL database backends will now automatically reconnect to the database when certain errors occur: all errors chosen have been experienced when testing this behavior through an AWS Aurora RDS failover of either MySQL or PostgreSQL.
For both backends we will reconnect when we see:
ECONNRESET
ECONNABORTED
ECONNREFUSED
io.EOF
io.ErrUnexpectedEOF
database/sql/driver.ErrBadConn
for postgres we will also reconnect on the following SQLStates:
25006 read-only transaction
57P03 cannot connect now
0A000 feature not supported, but ONLY when the message is cannot set transaction read-write mode during recovery
for mysql we will also reconnect when we see the following error codes:
1040 too many connections
1792 read-only transaction (SQLstate 25006)
1836 running in read-only mode
This logic is easily extensible should we discover more failure modes over time
Why?
We've had multiple community reports of Temporal problems during RDS failover. One part of this is the fact that we wouldn't necessarily reconnect; we were at the whims of our chosen SQL abstraction's connection pooling logic.
How did you test it?
I manually tested this functionality in the presence of repeated RDS failovers.
I'm manually testing the following combinations:
[x] postgres12 plugin with pq driver
[x] postgres12 plugin with pgx driver
[x] mysql plugin
Automated testing will be added to our regular testing pipelines once our infrastructure friends have added the support I need (it's in progress)
Potential risks
We're concerned there's a correctness issue in our PostgreSQL backend that's related to our behavior during an RDS failover. If we merge this before I figure out what's going on, we could hide the issue and make it harder to reproduce.
What changed?
Both our PostgreSQL and MySQL database backends will now automatically reconnect to the database when certain errors occur: all errors chosen have been experienced when testing this behavior through an AWS Aurora RDS failover of either MySQL or PostgreSQL.
For both backends we will reconnect when we see:
ECONNRESET
ECONNABORTED
ECONNREFUSED
io.EOF
io.ErrUnexpectedEOF
database/sql/driver.ErrBadConn
for postgres we will also reconnect on the following SQLStates:
25006
read-only transaction57P03
cannot connect now0A000
feature not supported, but ONLY when the message iscannot set transaction read-write mode during recovery
for mysql we will also reconnect when we see the following error codes:
1040
too many connections1792
read-only transaction (SQLstate25006
)1836
running in read-only modeThis logic is easily extensible should we discover more failure modes over time
Why?
We've had multiple community reports of Temporal problems during RDS failover. One part of this is the fact that we wouldn't necessarily reconnect; we were at the whims of our chosen SQL abstraction's connection pooling logic.
How did you test it?
I manually tested this functionality in the presence of repeated RDS failovers. I'm manually testing the following combinations:
Automated testing will be added to our regular testing pipelines once our infrastructure friends have added the support I need (it's in progress)
Potential risks
We're concerned there's a correctness issue in our PostgreSQL backend that's related to our behavior during an RDS failover. If we merge this before I figure out what's going on, we could hide the issue and make it harder to reproduce.
Documentation
N/A
Is hotfix candidate?
Yes?