google / trillian

A transparent, highly scalable and cryptographically verifiable data store.
Apache License 2.0
3.53k stars 379 forks source link

Signer should take health into account during election process #687

Open rolandshoemaker opened 7 years ago

rolandshoemaker commented 7 years ago

If trillian_log_signer is unhealthy (i.e. it is unable to connect to the configured MySQL node) it will still attempt to become the sequencer master via etcd. In this case it probably shouldn't try to become the master since another node may actually be able to do the work which it cannot if elected.

daviddrysdale commented 7 years ago

Have you seen this happening? I'd have thought the database connectivity check at startup would catch the specific case you mention; are there other unhealthy scenarios?

rolandshoemaker commented 7 years ago

We've seen this in our environment where we configure a signer to directly talk to a single DB node that then drops out of the Galera cluster for whatever reason. That said perhaps the correct solution to this is implementing #546 or encouraging the use of some kind of external health aware query LB.

jsha commented 6 years ago

Ping on this?

rolandshoemaker commented 6 years ago

Note: due to codership/galera#491 this is no longer super important for us, since we use single node writes if the master database node goes down all of the signer nodes will be equally broken.

That said if galera changes this behavior (looks like they may actually be working on it :tada:) it'd still be nice to have something like this.

daviddrysdale commented 6 years ago

@rolandshoemaker, can I check: would ReadOnlyLogStorage.CheckDatabaseAccessible() return an error when your signer node is unhealthy? (If so, I guess we can use that in the election loop in server/log_operation_manager.go.)

rolandshoemaker commented 6 years ago

I'll try this out but taking a quick look at how it works I'm pretty sure using ReadOnlyLogStorage.CheckDatabaseAccessible() would do what we want here.

Martin2112 commented 6 years ago

Somewhat related the above status is published via HTTP on /healthz so this could be detected through monitoring.

It still makes sense that an unhealthy node should withdraw from the election.

pav-kv commented 3 years ago

Blocked on #1640.