keikoproj / lifecycle-manager

Graceful AWS scaling event on Kubernetes using lifecycle hooks
Apache License 2.0
93 stars 28 forks source link

Using lifecycle-manager with spot instances #65

Open janavenkat opened 4 years ago

janavenkat commented 4 years ago

Is this a BUG REPORT or FEATURE REQUEST?:

May be bug or some workaround needed

What happened:

Not working while during spot instance termination. Also checked this https://github.com/keikoproj/lifecycle-manager/issues/18#issuecomment-547652760.

Is it possible to do with Input Transformer?

image

What you expected to happen:

Should or work around to deal with spot instance termination event

How to reproduce it (as minimally and precisely as possible):

Create spot instance ASG and wait for the termination from the AWS

Environment:

Seeing the logs when the spot interrupt event occurs level=warning msg="got unsupported event type: ''"

eytan-avisror commented 4 years ago

lifecycle-manager currently does not support spot termination events. to support the spot termination events in the SQS queue, significant refactoring would be required.

We are open to PRs that can achieve spot termination handling.

In the meanwhile you can run https://github.com/aws/aws-node-termination-handler as a daemonset to achieve this.

janavenkat commented 4 years ago

lifecycle-manager currently does not support spot termination events. to support the spot termination events in the SQS queue, significant refactoring would be required.

We are open to PRs that can achieve spot termination handling.

In the meanwhile you can run https://github.com/aws/aws-node-termination-handler as a daemonset to achieve this.

Thank you for the response between we can customize the target using target with input transformer as screenshot attached

yuri-1987 commented 3 years ago

https://github.com/aws/aws-node-termination-handler, does not solve this issue, it potentially can cause other problems because it cordons a node and the k8s service controller removes it immediately from ELB without draining it first, it can drop inflight requests. I understand that lifecycle manager was not built to handle spot interruptions and the content of the AWS eventbridge event is relatively minimal, it provides mostly the instanceId. I assume that the lifecycle manager wants to know if this event is related to the cluster it is running on. I imagine it can be solved by checking tags on that ec2 before handling the event or with the current check that verifies if the node in question is seen in cluster nodes.

I did a little POC and used the input transformer in the AWS event bridge, translating ec2 spot interruption event to the event sent by a lifecycle hook in ASG

input path:

{"id":"$.id","instance":"$.detail.instance-id","time":"$.time"}

input template:

{
    "LifecycleHookName": "lifecycle-manager",
    "AccountId": "YOUR_ACCOUNT_ID",
    "RequestId": <id>,
    "LifecycleTransition": "autoscaling:EC2_INSTANCE_TERMINATING",
    "AutoScalingGroupName": "ASG_NAME",
    "Service": "AWS Auto Scaling",
    "Time": <time>,
    "EC2InstanceId": <instance>,
    "LifecycleActionToken": "CHECK_TOKEN_FROM_THE_ORIGINAL_EVENT"
}

lifecycle manage was actually able to parse this and rejected this

level=debug msg="event 11e3aa3d-29f0-955f-24a0-xxxxxxxxxx has been rejected for processing: instance i-0cf2b023801xxxx is not seen in cluster nodes"
level=debug msg="deleting message with receipt ID AQEBh9ZhhiI.....

obviously, this deletion can cause an issue, if you running multiple clusters with multiple lifecycle managers, as a dumb solution I would just try to stream this spot event to multiple queues, and let each lifecycle manager handle its own queue
I will update later if this works.

eytan-avisror commented 3 years ago

@yuri-1987 interesting solution with transforming the spot event to a lifecycle hook event. The error message indicates that the instance ID was not found on any of the cluster node - was the node already removed by the time this event was received?

I think if you are able to send the correct termination event early enough it would be processed and drained/excluded from ELB.

If the instance is already terminated by the time lifecycle-manager gets the event it would reject it as above.

yuri-1987 commented 3 years ago

Hi @eytan-avisror, sorry for not getting back earlier, regarding your question, the event bridge can't filter ec2 spot interruptions events; thus, I'm sending all spot events to the sqs. The log snippet I have attached in my previous comment is for a node that is indeed not in the cluster and belongs to another cluster.
Our account is set with several clusters, so my idea was to create sqs per cluster and let the event bridge send the same spot event to several queues, lifecycle manager itself running on all clusters and watching its own queue; eventually, it will get the right event and will try to handle it.

so this is a log from that spot event. I assume that 120 seconds are not enough for the lifecycle manager to handle this

time="2020-12-21T22:01:35Z" level=info msg="i-0225da028f00xxxxx> received termination event"
time="2020-12-21T22:01:35Z" level=info msg="i-0225da028f00xxxxx> sending heartbeat (1/24)"
time="2020-12-21T22:01:35Z" level=error msg="i-0225da028f00xxxxx> failed to send heartbeat for event: ValidationError: No active Lifecycle Action found with token 9c2c3045-e401-4c50-a439-7a133073xxxx\n\tstatus code: 400, request id: 6d5dc896-b2aa-430b-8950-b9dcdb2dxxxx"
time="2020-12-21T22:01:35Z" level=info msg="i-0225da028f00xxxxx> draining node/ip-172-24-77-206.ec2.internal"
time="2020-12-21T22:02:18Z" level=info msg="i-0225da028f00xxxxx> completed drain for node/ip-172-24-77-206.ec2.internal"
time="2020-12-21T22:02:18Z" level=info msg="i-0225da028f00xxxxx> starting load balancer drain worker"
time="2020-12-21T22:02:18Z" level=info msg="i-0225da028f00xxxxx> scanner starting"
time="2020-12-21T22:02:18Z" level=info msg="i-0225da028f00xxxxx> checking targetgroup/elb membership"
time="2020-12-21T22:04:15Z" level=info msg="i-0225da028f00xxxxx> received termination event"
time="2020-12-21T22:04:15Z" level=info msg="i-0225da028f00xxxxx> sending heartbeat (1/24)"
time="2020-12-21T22:04:15Z" level=info msg="i-0225da028f00xxxxx> draining node/ip-172-24-77-206.ec2.internal"
time="2020-12-21T22:04:15Z" level=info msg="i-0225da028f00xxxxx> completed drain for node/ip-172-24-77-206.ec2.internal"
time="2020-12-21T22:04:15Z" level=info msg="i-0225da028f00xxxxx> starting load balancer drain worker"
time="2020-12-21T22:04:16Z" level=info msg="i-0225da028f00xxxxx> scanner starting"
time="2020-12-21T22:04:16Z" level=info msg="i-0225da028f00xxxxx> checking targetgroup/elb membership"
time="2020-12-21T22:04:24Z" level=info msg="i-0225da028f00xxxxx> found 0 target groups & 142 classic-elb"
time="2020-12-21T22:04:49Z" level=info msg="i-0225da028f00xxxxx> queuing deregistrator"
time="2020-12-21T22:04:49Z" level=info msg="i-0225da028f00xxxxx> queuing waiters"
time="2020-12-21T22:04:49Z" level=info msg="deregistrator> no active targets for deregistration"
time="2020-12-21T22:04:50Z" level=error msg="call failed with output: Error from server (NotFound): nodes \"ip-172-24-77-206.ec2.internal\" not found\n,  error: exit status 1"
time="2020-12-21T22:04:50Z" level=error msg="failed to annotate node ip-172-24-77-206.ec2.internal"
time="2020-12-21T22:04:50Z" level=info msg="event d116b980-356d-adac-dcd4-01c8e852cxxx completed processing"
time="2020-12-21T22:04:50Z" level=info msg="i-0225da028f00xxxxx> setting lifecycle event as completed with result: CONTINUE"
time="2020-12-21T22:04:50Z" level=info msg="event d116b980-356d-adac-dcd4-01c8e852cxxx for instance i-0225da028f00xxxxx completed after 194.795793921s"
time="2020-12-21T22:05:50Z" level=info msg="i-0225da028f00xxxxx> found 0 target groups & 142 classic-elb"
time="2020-12-21T22:06:04Z" level=info msg="i-0225da028f00xxxxx> queuing deregistrator"
time="2020-12-21T22:06:04Z" level=info msg="i-0225da028f00xxxxx> queuing waiters"
time="2020-12-21T22:06:04Z" level=info msg="deregistrator> no active targets for deregistration"
time="2020-12-21T22:06:04Z" level=error msg="call failed with output: Error from server (NotFound): nodes \"ip-172-24-77-206.ec2.internal\" not found\n,  error: exit status 1"
time="2020-12-21T22:06:04Z" level=error msg="failed to annotate node ip-172-24-77-206.ec2.internal"
time="2020-12-21T22:06:04Z" level=info msg="event a505da1d-536b-e777-ed3b-3abd96f0ebae completed processing"
time="2020-12-21T22:06:04Z" level=info msg="i-0225da028f00xxxxx> setting lifecycle event as completed with result: CONTINUE"
time="2020-12-21T22:06:04Z" level=error msg="failed to complete lifecycle action: ValidationError: No active Lifecycle Action found with instance ID i-0225da028f00xxxxx\n\tstatus code: 400, request id: e0f8ea0c-3088-4c2f-9537-efbd399c4130"
time="2020-12-21T22:06:04Z" level=info msg="event a505da1d-536b-e777-ed3b-3abd96f0ebae for instance i-0225da028f00xxxxx completed after 108.679362044s"