CloudFormation changeset update to hackoregon stack rolls back

Attempted a number of deploys of the master.yaml using a change set in CloudFormation over the past couple of days. They have failed, but failed in a weird way that we haven't observed before, and the UPDATE_ROLLBACK takes forever to complete (as opposed to typical failures, which rollback in under ten minutes).

The following type of error is observed for three of the services in our stack - 2018HA, 2018LE, 2018ND (housing-affordability, local-elections and neighborhood-development):

Embedded stack arn:aws:cloudformation:us-west-2:845828040396:stack/hacko-integration-2018ND-IYGSWS3N2LJ9/69660fe0-6104-11e8-a46c-500c32c86c35 was not successfully updated. Currently in UPDATE_ROLLBACK_IN_PROGRESS with reason: The following resource(s) failed to update: [Service].

The prevailing theory right now is that the update is failing because the deploy of current task definitions isn't succeeding, because one of the two EC2 hosts for our containers is not healthy, and is not able to launch containers.

Consequences

We are unable to update the Path and HealthCheckPath settings for the services, making it impossible for the containers to receive requests via the load balancer (and ultimately causing ECS to recycle the containers and try again)
We cannot update the Desired Tasks to 2 from 0 for the services whose containers are deployable (ND, LE and DR).
We will be unable to update the memory configuration for newer services (other than DR and TS, whose settings provide ample headroom), thus risking that some containers will be unable to load because they consume more RAM than allocated by default (100 MB).

hackoregon / civic-devops

CloudFormation changeset update to hackoregon stack rolls back #142

Consequences