temporalio / temporal

Temporal service
https://docs.temporal.io
MIT License
10.3k stars 754 forks source link

Reconnect to SQL databases when connections fail #5926

Closed tdeebswihart closed 2 weeks ago

tdeebswihart commented 2 weeks ago

What changed?

Both our PostgreSQL and MySQL database backends will now automatically reconnect to the database when certain errors occur: all errors chosen have been experienced when testing this behavior through an AWS Aurora RDS failover of either MySQL or PostgreSQL.

For both backends we will reconnect when we see:

for postgres we will also reconnect on the following SQLStates:

for mysql we will also reconnect when we see the following error codes:

This logic is easily extensible should we discover more failure modes over time

Why?

We've had multiple community reports of Temporal problems during RDS failover. One part of this is the fact that we wouldn't necessarily reconnect; we were at the whims of our chosen SQL abstraction's connection pooling logic.

How did you test it?

I manually tested this functionality in the presence of repeated RDS failovers. I'm manually testing the following combinations:

Automated testing will be added to our regular testing pipelines once our infrastructure friends have added the support I need (it's in progress)

Potential risks

We're concerned there's a correctness issue in our PostgreSQL backend that's related to our behavior during an RDS failover. If we merge this before I figure out what's going on, we could hide the issue and make it harder to reproduce.

Documentation

N/A

Is hotfix candidate?

Yes?

tdeebswihart commented 2 weeks ago

My first test (postgres12 with pq driver) successfully handled an RDS failover. All processes reconnected to the new writer during the failover.