Closed lorengordon closed 1 week ago
Hey @lorengordon did the lambda function continuously spin up and error with that response or did it error once and then stop?
It errored just the one time. It continued to execute on the schedule, but every other time it ran it simply exited with the normal INFO message, "no scaling decision to be made"
.
It happened again yesterday, so I was able to narrow down the workflow to get it working. The instance it failed to kill was left detached from the auto-scaling group but still running. In spacelift, it was marked as drained
. Once I terminated the instance, the lambda recovers on the next execution and begins updating the autoscaling group based on the spacelift worker pool queue.
I suppose I'm seeing three areas where things are not behaving quite right:
drained
and don't just skip work. In theory, a drained
worker could be terminated if already removed from the ASG. That would make 1&2 the same from a handling perspective. Let 1 fail, but handle the resolution when the lambda executes next.Hey, @lorengordon. Thank you for your contribution. I merged the fix today and made a new release.
Let us know if you have any problems.
@ilya-hontarau Thanks so much! We just updated this morning. I'll let you know if the issue recurs.
Should I open a separate issue about the way error messages are not using structured logs?
@lorengordon sorry for the late response, yes, please, a new issue for the logging makes sense
Been running this Lambda function for a while now, and it's been pretty great. Noticed this morning that for some reason the ASG wasn't updating capacity based on the Spacelift queue and went to investigate. Found an error:
Seemingly the Lambda never recovered after that? Found an instance detached from the ASG, terminated it manually. Drained all the workers in Spacelift. Set the ASG desired capacity to 1. Killed the last worker and let it respawn. That seems to have kicked it back into gear.
(Note that the error log is not structured properly like all the other INFO logs, but that's a separate issue. The Lambda error handling needs some work, I think.)