Closed jamessewell closed 1 year ago
Ok I'm pinging our primary stakeholders on this. Ping! @adamhjk @reset @cm @baumanj .
I think we all are sort of thinking having this feature makes sense but as this is a significant amount of work to implement we'd prefer to get some more comments and thoughts shared here in a durable medium. As you get the time, please leave some comments with your thoughts!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.
Still needed.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. We value your input and contribution. Please leave a comment if this issue still affects you.
This issue has been automatically closed after being stale for 400 days. We still value your input and contribution. Please re-open the issue if desired and leave a comment with details.
This is a placeholder issue based on discussions with @adamhjk and @reset
At the moment as far as I can tell elections will (at least once 3246 is fixed) happen under the following circumstances:
All these amount to the
sup
health becoming poor on the leader (I thinkalive=false
gets set?).Given that Habitat is aiming to provide services, and given that it's being used for high availability I think that it needs to be able to trigger an election based on leader service health as well (if configured).
At the moment there is
health_check
hook, which seems to be intended for use by monitoring and alerting solutions. This can return several values, which are mapped as so:It would be great if one of the following could happen:
health_check
hook was changed to allow a return value to forcefully depart a node (danger: what if the hook inadvertently passes through this retval from another binary)As @reset pointed out when reasoning about this it needs to be remembered that health is transient and electing a new leader is a major event:
I think to manage this there needs to be at least some sort of holdoff period (or number of failed checks?) before the force depart to allow the leader time to return to the cluster.
It would be ideal (but would add complexity I suppose) if this could be configured. This would allow users to align the holdoff time with their desired mean time to recovery / recovery point objective.