uber / cadence

Cadence is a distributed, scalable, durable, and highly available orchestration engine to execute asynchronous long-running business logic in a scalable and resilient way.
https://cadenceworkflow.io
MIT License
8.3k stars 800 forks source link

[sql connector] Please recreate sql connection on connection failure #3071

Closed hankduan closed 5 days ago

hankduan commented 4 years ago

Summary: We had an incident where we failed over the mysql instance so the cname resolves to another ip at that time, but cadence did not recover. Cadence recovered after manually restarting it. Later we found this is because Cadence uses go-sql-driver, which does not re-establish the connection to sql on failure. Given this is a common failure mode, can Cadence recreate the sql connection on connection failure?

Here's the slack chat thread

Regarding the MySQL connector for cadence. We had an incident where we failed over the mysql instance so the cname resolves to another ip at that time, but cadence did not recover. Does Cadence's mysql connector re-resolve the cname

Maxim 23 hours ago Filed https://github.com/go-sql-driver/mysql/issues/1064

Hank Duan 23 hours ago Thanks!

Maxim 21 hours ago Did Cadence recover after restart?

Hank Duan 17 hours ago yea

Hank Duan 17 hours ago but it required a manual restart

Maxim 7 hours ago Then I believe it is db driver issue

Hank Duan 2 hours ago looks like they will not address it at the driver side. Are there any plans for cadence to reestablish connection on failures?

Hank Duan 2 hours ago this is a very common failure mode

Hank Duan 2 hours ago i.e. db failure over causes cname to switch. The original ip might no longer be connectable and will fail indefinitely until cadence re-resolve cname

Maxim 1 hour ago Please file an issue

jontro commented 4 years ago

Just to chime in here, the issue could potentially be caused by a too long idle timeout.

See http://go-database-sql.org/connection-pool.html and this linked issue https://github.com/go-sql-driver/mysql/issues/257

Changing maxConnLifetime should be a first step to try. Afaik the connection manager should take care of retries. Dns records must of course be properly configured with a low ttl

sheerun commented 4 years ago

Could this be handled at cadence level by specifying multiple DNS addresses to connect? e.g. addresses to primary master and secondary master