hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

"Service... did not stabilize" error when updating the CloudFormation stack #175

Closed MikeTheCanuck closed 6 years ago

MikeTheCanuck commented 6 years ago

Attempted to update the CF stack last night with the changes due to PR 38:

Observed conditions

Theory on the failure

Evidence for this theory

screen shot 2018-06-17 at 10 52 28
MikeTheCanuck commented 6 years ago

This changeset execution is currently consuming all but 747 MB of memory on the EC2 instances:

screen shot 2018-06-17 at 11 04 00

And it's taking an awfully long time to resolve either way...

And the transportService (the 2017 monster) is the only one that's currently in "UPDATE_IN_PROGRESS" state rather than "UPDATE_COMPLETE_CLEANUP_IN_PROGRESS" for all the rest of the affected services.

screen shot 2018-06-17 at 11 05 31
MikeTheCanuck commented 6 years ago

Seeing as the new attempt to bring up transportService has been in this state for ~25 minutes now, with no further message:

screen shot 2018-06-17 at 11 09 03

I'm going to assume that this isn't likely to ultimately succeed.

So I went to the older/active instance of transportService that's that was configured down to 1 task, and tuned it down to 0 tasks - hoping this'll catch things before they miserably fail and rollback this latest attempt.

MikeTheCanuck commented 6 years ago

And then because the 1 remaining Task was still running, I went in and manually Stop'd it, thus freeing up the memory on the one EC2 instance:

screen shot 2018-06-17 at 11 14 06
MikeTheCanuck commented 6 years ago

It finally succeeded, nearly a half-hour into the attempt. Did freeing up the memory from that one lingering 2GB task make the difference? Who's to say? But it's definitely making progress back to a stable cluster:

screen shot 2018-06-17 at 11 19 38 screen shot 2018-06-17 at 11 22 56 screen shot 2018-06-17 at 11 28 18
MikeTheCanuck commented 6 years ago

Conclusion

What caused this to fail? Who can know for sure? The fact that one of our Tasks - when spun in two instances, without the "spread" settings from PR 38 already in place - could still potentially swamp the remaining memory on an EC2 host, is a very likely candidate.

We *shouldn't see this any longer, so long as the cluster (a) keeps the "spread" settings, and (b) never tries to launch a Service that could consume the remaining memory on the EC2 host.