habitat-sh / habitat

Modern applications with built-in automation
https://www.habitat.sh
Apache License 2.0
2.61k stars 315 forks source link

[RFC] Election is never based on service health #3249

Closed jamessewell closed 1 year ago

jamessewell commented 7 years ago

This is a placeholder issue based on discussions with @adamhjk and @reset

At the moment as far as I can tell elections will (at least once 3246 is fixed) happen under the following circumstances:

All these amount to the sup health becoming poor on the leader (I think alive=false gets set?).

Given that Habitat is aiming to provide services, and given that it's being used for high availability I think that it needs to be able to trigger an election based on leader service health as well (if configured).

At the moment there is health_check hook, which seems to be intended for use by monitoring and alerting solutions. This can return several values, which are mapped as so:

It would be great if one of the following could happen:

As @reset pointed out when reasoning about this it needs to be remembered that health is transient and electing a new leader is a major event:

I think to manage this there needs to be at least some sort of holdoff period (or number of failed checks?) before the force depart to allow the leader time to return to the cluster.

It would be ideal (but would add complexity I suppose) if this could be configured. This would allow users to align the holdoff time with their desired mean time to recovery / recovery point objective.

eeyun commented 6 years ago

Ok I'm pinging our primary stakeholders on this. Ping! @adamhjk @reset @cm @baumanj .

I think we all are sort of thinking having this feature makes sense but as this is a significant amount of work to implement we'd prefer to get some more comments and thoughts shared here in a durable medium. As you get the time, please leave some comments with your thoughts!

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

christophermaier commented 4 years ago

Still needed.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.

stale[bot] commented 1 year ago

This issue has been automatically closed after being stale for 400 days. We still value your input and contribution. Please re-open the issue if desired and leave a comment with details.