Cluster stack update failing due to ECS autoscaling group unable to deploy new EC2 hosts

hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform

MIT License

11 stars 4 forks source link

In CloudFormation "update stack" operations, three times in a row tonight it's failed with the following messages in the "ECS" nested stack:

Rolling update initiated. Terminating 2 obsolete instance(s) in batches of 1, while keeping at least 1 instance(s) in service. Waiting on resource signals with a timeout of PT15M when new instances are added to the autoscaling group.

Terminating instance(s) [i-081d4928c8e82e463]; replacing with 1 new instance(s).

Successfully terminated instance(s) [i-081d4928c8e82e463] (Progress 50%).

New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT15M.

Failed to receive 1 resource signal(s) for the current batch. Each resource signal timeout is counted as a FAILURE.

Received 0 SUCCESS signal(s) out of 2. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement

The following resource(s) failed to update: [ECSAutoScalingGroup].

At which point the update rolls back.

I've not seen this behaviour in recent memory, but my hypothesis so far is AWS recently released a new AMI and that there's something about the ECS configuration we do to the AMI after launch that is no longer compatible with the latest AMI:

there are two EC2 boxes that host the non-Fargate containers - there are six remaining containers to migrate to Fargate, and they will take more effort than I have time for myself at this time. (HELP NEEDED)
Last year I updated our ECS configuration so that rather than hard-coding to an AMI that would soon become obsolete and unsupported, it would always reference the latest AMI as published by AWS (see https://github.com/hackoregon/hackoregon-aws-infrastructure/blob/master/infrastructure/ecs-cluster.yaml#L40)
Since that time I've never seen a problem, and had forgotten about it until tonight
There are currently two EC2 instances running, based on an AMI identifier of amzn-ami-2018.03.y-amazon-ecs-optimized (ami-0f7bc74af1927e7c8)
the other attempted/failed EC2 instances are being launched tonight every time I try to update the stack, and have an AMI ID of amzn-ami-2018.03.20191014-amazon-ecs-optimized (ami-0b148ba19fce895b3)

hackoregon / civic-devops

Cluster stack update failing due to ECS autoscaling group unable to deploy new EC2 hosts #282