"Service... did not stabilize" error when updating the CloudFormation stack

MikeTheCanuck commented 6 years ago

Attempted to update the CF stack last night with the changes due to PR 38:

it would not allow me to make the update, Update Failed and it all rolled back - I did not have sufficient permissions (as observed in the Events details of one of the affected, nested Stacks):

mikethecanuck@gmail.com is not authorized to perform: iam:PassRole on resource: arn:aws:iam::845828040396:role/ecs-service-hacko-integration-2018LE-1GVBYRVJAIDYJ (Service: AmazonECS; Status Code: 400; Error Code: AccessDenied
when Michael Lange (who has god access) attempted to run the changeset, that attempt too failed with the following error:

Service arn:aws:ecs:us-west-2:845828040396:service/hacko-integration-transportService-67KME5SFWBJO-Service-1OVJNMOPH8ZH2 did not stabilize.

Observed conditions

the HackO cluster was running with 14 Services and 28 Tasks
out of the 8GB of memory on each underlying EC2 instance, there was approx. 3GB free

Theory on the failure

we haven't yet implemented the "spread" settings to any services
perhaps when the changeset was executed, it caused the 14 Services to spin up 28 more Tasks, and of those 28 new Tasks, it's possible that (lacking the "spread" settings), the two new 2017 transport tasks (containers) - each requiring 2GB of memory - were both loaded on the same EC2 instance
and since that would've required 4GB of memory, and less than 4GB was available, ECS noticed that at least one of the tasks couldn't start

Evidence for this theory

I have temporarily (manually) told ECS to only run 1 Task instance of the 2017 transport service (through the ECS console)
then we executed the same settings in a new changeset as we'd tried last night
when I looked at the state of the cluster while we're waiting for the changeset to complete (or rollback), I noticed that there were now 28 Services and 54 Tasks running (54 = 28 x 2 less two for the 1 less task per EC2 instance for the 2017 transport service)

MikeTheCanuck commented 6 years ago

This changeset execution is currently consuming all but 747 MB of memory on the EC2 instances:

And it's taking an awfully long time to resolve either way...

And the transportService (the 2017 monster) is the only one that's currently in "UPDATE_IN_PROGRESS" state rather than "UPDATE_COMPLETE_CLEANUP_IN_PROGRESS" for all the rest of the affected services.

MikeTheCanuck commented 6 years ago

Seeing as the new attempt to bring up transportService has been in this state for ~25 minutes now, with no further message:

I'm going to assume that this isn't likely to ultimately succeed.

So I went to the older/active instance of transportService that's that was configured down to 1 task, and tuned it down to 0 tasks - hoping this'll catch things before they miserably fail and rollback this latest attempt.

MikeTheCanuck commented 6 years ago

And then because the 1 remaining Task was still running, I went in and manually Stop'd it, thus freeing up the memory on the one EC2 instance:

MikeTheCanuck commented 6 years ago

It finally succeeded, nearly a half-hour into the attempt. Did freeing up the memory from that one lingering 2GB task make the difference? Who's to say? But it's definitely making progress back to a stable cluster:

MikeTheCanuck commented 6 years ago

Conclusion

What caused this to fail? Who can know for sure? The fact that one of our Tasks - when spun in two instances, without the "spread" settings from PR 38 already in place - could still potentially swamp the remaining memory on an EC2 host, is a very likely candidate.

We *shouldn't see this any longer, so long as the cluster (a) keeps the "spread" settings, and (b) never tries to launch a Service that could consume the remaining memory on the EC2 host.

hackoregon / civic-devops