hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

Cluster stack update failing due to ECS autoscaling group unable to deploy new EC2 hosts #282

Closed MikeTheCanuck closed 4 years ago

MikeTheCanuck commented 4 years ago

In CloudFormation "update stack" operations, three times in a row tonight it's failed with the following messages in the "ECS" nested stack:

Rolling update initiated. Terminating 2 obsolete instance(s) in batches of 1, while keeping at least 1 instance(s) in service. Waiting on resource signals with a timeout of PT15M when new instances are added to the autoscaling group.

Terminating instance(s) [i-081d4928c8e82e463]; replacing with 1 new instance(s).

Successfully terminated instance(s) [i-081d4928c8e82e463] (Progress 50%).

New instance(s) added to autoscaling group - Waiting on 1 resource signal(s) with a timeout of PT15M.

Failed to receive 1 resource signal(s) for the current batch. Each resource signal timeout is counted as a FAILURE.

Received 0 SUCCESS signal(s) out of 2. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement

The following resource(s) failed to update: [ECSAutoScalingGroup].

At which point the update rolls back.

MikeTheCanuck commented 4 years ago

I've not seen this behaviour in recent memory, but my hypothesis so far is AWS recently released a new AMI and that there's something about the ECS configuration we do to the AMI after launch that is no longer compatible with the latest AMI: