[RFC] Election is never based on service health

jamessewell commented 7 years ago

This is a placeholder issue based on discussions with @adamhjk and @reset

At the moment as far as I can tell elections will (at least once 3246 is fixed) happen under the following circumstances:

leader is departed
leader is stopped (gracefully / ungracefully)
leader is on the wrong side of a netsplit

All these amount to the sup health becoming poor on the leader (I think alive=false gets set?).

Given that Habitat is aiming to provide services, and given that it's being used for high availability I think that it needs to be able to trigger an election based on leader service health as well (if configured).

At the moment there is health_check hook, which seems to be intended for use by monitoring and alerting solutions. This can return several values, which are mapped as so:

0 => HealthCheck::Ok
1 => HealthCheck::Warning
2 => HealthCheck::Critical
3 => HealthCheck::Unknown
_ => HealthCheck::Unknown

It would be great if one of the following could happen:

a new hook was added, which allowed a leader (or any node?) to be forcefully departed on failure
the health_check hook was changed to allow a return value to forcefully depart a node (danger: what if the hook inadvertently passes through this retval from another binary)

As @reset pointed out when reasoning about this it needs to be remembered that health is transient and electing a new leader is a major event:

what happens when a leader returns to health before promotion?
what happens when a leader returns to health during promotion?

I think to manage this there needs to be at least some sort of holdoff period (or number of failed checks?) before the force depart to allow the leader time to return to the cluster.

It would be ideal (but would add complexity I suppose) if this could be configured. This would allow users to align the holdoff time with their desired mean time to recovery / recovery point objective.

eeyun commented 6 years ago

Ok I'm pinging our primary stakeholders on this. Ping! @adamhjk @reset @cm @baumanj .

I think we all are sort of thinking having this feature makes sense but as this is a significant amount of work to implement we'd prefer to get some more comments and thoughts shared here in a durable medium. As you get the time, please leave some comments with your thoughts!

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

christophermaier commented 4 years ago

Still needed.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

stale[bot] commented 1 year ago

This issue has been automatically closed after being stale for 400 days. We still value your input and contribution. Please re-open the issue if desired and leave a comment with details.

habitat-sh / habitat

[RFC] Election is never based on service health #3249