Added support for surviving worker count

kube-aws / kube-spot-termination-notice-handler

A Kubernetes DaemonSet to gracefully delete pods 2 minutes before an EC2 Spot Instance gets terminated

Apache License 2.0

378 stars 77 forks source link

Added support for surviving worker count #32

Closed ghost closed 4 years ago

ghost commented 5 years ago

"I am dying, but there are X other workers still running."

Passing a tag name/value(s) will enable an additional query of the AWS API to count similarly-tagged instances in the worker pool. The count will be included in the notification message.

ghost commented 5 years ago

I'd be interested to hear people's thoughts on this addition. It's intended to give Ops users an immediate indication that all is well in their cluster(s) when they receive Slack notification about terminated spot instances, without them needing to look elsewhere or rely on additional monitoring.

max-rocket-internet commented 5 years ago

I get the idea but is the log output of this daemonset really the right place to "give Ops users an immediate indication"?

Wouldn't it make more sense to use a cloudwatch metric? e.g. if GroupDesiredCapacity != GroupInServiceInstances for longer than 5 mins then something has gone wrong with scaling

ghost commented 5 years ago

Hi @max-rocket-internet. Possibly not the right place, just another place. It's something I found myself wanting when I saw spot termination notices, so wanted to write the code and put it out there in case others saw value in it. I agree with you that robust external monitoring of actual versus desired capacity should also be in place.

It makes perfect sense to me to limit the tasks of this handler to the minimum it should do.