Open rolandshoemaker opened 7 years ago
Have you seen this happening? I'd have thought the database connectivity check at startup would catch the specific case you mention; are there other unhealthy scenarios?
We've seen this in our environment where we configure a signer to directly talk to a single DB node that then drops out of the Galera cluster for whatever reason. That said perhaps the correct solution to this is implementing #546 or encouraging the use of some kind of external health aware query LB.
Ping on this?
Note: due to codership/galera#491 this is no longer super important for us, since we use single node writes if the master database node goes down all of the signer nodes will be equally broken.
That said if galera changes this behavior (looks like they may actually be working on it :tada:) it'd still be nice to have something like this.
@rolandshoemaker, can I check: would ReadOnlyLogStorage.CheckDatabaseAccessible()
return an error when your signer node is unhealthy? (If so, I guess we can use that in the election loop in server/log_operation_manager.go
.)
I'll try this out but taking a quick look at how it works I'm pretty sure using ReadOnlyLogStorage.CheckDatabaseAccessible()
would do what we want here.
Somewhat related the above status is published via HTTP on /healthz so this could be detected through monitoring.
It still makes sense that an unhealthy node should withdraw from the election.
Blocked on #1640.
If
trillian_log_signer
is unhealthy (i.e. it is unable to connect to the configured MySQL node) it will still attempt to become the sequencer master viaetcd
. In this case it probably shouldn't try to become the master since another node may actually be able to do the work which it cannot if elected.