spacelift-io / ec2-workerpool-autoscaler

"Manual" autoscaler for the AWS EC2-based Spacelift worker pool
MIT License
2 stars 4 forks source link

Lambda never recovers when encountering an error #16

Closed lorengordon closed 1 week ago

lorengordon commented 2 weeks ago

Been running this Lambda function for a while now, and it's been pretty great. Noticed this morning that for some reason the ASG wasn't updating capacity based on the Spacelift queue and went to investigate. Found an error:

2024/10/25 17:20:34 
{
    "errorMessage": "could not kill instance: could not detach instance from autoscaling group: operation error Auto Scaling: DetachInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded",
    "errorType": "wrapError"
}

Seemingly the Lambda never recovered after that? Found an instance detached from the ASG, terminated it manually. Drained all the workers in Spacelift. Set the ASG desired capacity to 1. Killed the last worker and let it respawn. That seems to have kicked it back into gear.

(Note that the error log is not structured properly like all the other INFO logs, but that's a separate issue. The Lambda error handling needs some work, I think.)

Apollorion commented 2 weeks ago

Hey @lorengordon did the lambda function continuously spin up and error with that response or did it error once and then stop?

lorengordon commented 2 weeks ago

It errored just the one time. It continued to execute on the schedule, but every other time it ran it simply exited with the normal INFO message, "no scaling decision to be made".

It happened again yesterday, so I was able to narrow down the workflow to get it working. The instance it failed to kill was left detached from the auto-scaling group but still running. In spacelift, it was marked as drained. Once I terminated the instance, the lambda recovers on the next execution and begins updating the autoscaling group based on the spacelift worker pool queue.

lorengordon commented 2 weeks ago

I suppose I'm seeing three areas where things are not behaving quite right:

  1. When the lambda detaches the instance from the ASG, error handling could be improved to recover when the action fails
  2. When the lambda queries spacelift for the worker queue, handle when workers are drained and don't just skip work. In theory, a drained worker could be terminated if already removed from the ASG. That would make 1&2 the same from a handling perspective. Let 1 fail, but handle the resolution when the lambda executes next.
  3. The error logging is not following the same structured logging principles, which made it a little harder to search and find the error message. I first saw structured logs at the INFO level, so assumed I could query for ERROR. But the actual error logs use a completely different output format so that didn't find anything at all.
ilya-hontarau commented 1 week ago

Hey, @lorengordon. Thank you for your contribution. I merged the fix today and made a new release.

Let us know if you have any problems.

lorengordon commented 1 week ago

@ilya-hontarau Thanks so much! We just updated this morning. I'll let you know if the issue recurs.

Should I open a separate issue about the way error messages are not using structured logs?

ilya-hontarau commented 6 days ago

@lorengordon sorry for the late response, yes, please, a new issue for the logging makes sense