keikoproj / lifecycle-manager

Graceful AWS scaling event on Kubernetes using lifecycle hooks
Apache License 2.0
93 stars 28 forks source link

Hooks to run scripts at different stages #66

Open challapradyumna opened 3 years ago

challapradyumna commented 3 years ago

FEATURE REQUEST:

What happened: Draining & removing nodes from ASG also requires us to update the monitoring systems to stop monitoring the node and a few other things before the node can be safely removed from circulation

What you expected to happen: If there were flags where shell scripts could be tagged along with lifecycle manager at different stages that would make lifecycle-manager extensible for various use cases.

eytan-avisror commented 3 years ago

Hi @challapradyumna Great idea and interesting use-case. Do you need to run a script on the controller, or on the terminating node?

A simple script implementation might be a bit problematic since this is a service and not a controller, and there is no custom resource - so the only interface are flags passed in, also, allowing arbitrary script execution might have security implications.

I think one possible approach to have something secure and configurable that answers this use case is to use SSM send-command.

User can then specify a specific pre-created script to execute via flag e.g. --ssm-finalize-script which would invoke the script and wait for completion.

This would require users to integrate SSM on their AMIs, but would be easy to implement and relatively more secure, since the script is pre-created.

WDYT?

challapradyumna commented 3 years ago

At least for our use-case, it's about muting the instance on datadog, other third-party services nothing on the instance.

Yes and No with the SSM integration makes sense to do it but becomes a pre-requisite for anyone to use this feature.

I'm thinking more on the lines of calling a webhook or kicking off a job inside the cluster itself sending the instance details as a parameter.

eytan-avisror commented 3 years ago

Are you referring to running this webhook inside the lifecycle-manager pod, or from the terminating instance? Is this supposed to be blocking for the instance termination? Is it 'best-effort' attempt, or do you need to validate the call response?

shrinandj commented 3 years ago

One other option would be to label the node (or add some annotation) when it is about to be drained and terminated. Most Kubernetes aware tools allow for configuration based on that.

Would that help?

challapradyumna commented 3 years ago

I'm looking more like a flag e.g: --post-drain project/mute-instance. This would run the container mentioned as a job. I got the issue with shell scripts it becomes too much of a hassle to maintain that in the long run.