Closed ghost closed 4 years ago
I'd be interested to hear people's thoughts on this addition. It's intended to give Ops users an immediate indication that all is well in their cluster(s) when they receive Slack notification about terminated spot instances, without them needing to look elsewhere or rely on additional monitoring.
I get the idea but is the log output of this daemonset really the right place to "give Ops users an immediate indication"?
Wouldn't it make more sense to use a cloudwatch metric? e.g. if GroupDesiredCapacity != GroupInServiceInstances
for longer than 5 mins then something has gone wrong with scaling
Hi @max-rocket-internet. Possibly not the right place, just another place. It's something I found myself wanting when I saw spot termination notices, so wanted to write the code and put it out there in case others saw value in it. I agree with you that robust external monitoring of actual versus desired capacity should also be in place.
It makes perfect sense to me to limit the tasks of this handler to the minimum it should do.
"I am dying, but there are X other workers still running."
Passing a tag name/value(s) will enable an additional query of the AWS API to count similarly-tagged instances in the worker pool. The count will be included in the notification message.