hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

CloudFormation changeset update to hackoregon stack rolls back #142

Closed MikeTheCanuck closed 6 years ago

MikeTheCanuck commented 6 years ago

Attempted a number of deploys of the master.yaml using a change set in CloudFormation over the past couple of days. They have failed, but failed in a weird way that we haven't observed before, and the UPDATE_ROLLBACK takes forever to complete (as opposed to typical failures, which rollback in under ten minutes).

The following type of error is observed for three of the services in our stack - 2018HA, 2018LE, 2018ND (housing-affordability, local-elections and neighborhood-development):

Embedded stack arn:aws:cloudformation:us-west-2:845828040396:stack/hacko-integration-2018ND-IYGSWS3N2LJ9/69660fe0-6104-11e8-a46c-500c32c86c35 was not successfully updated. Currently in UPDATE_ROLLBACK_IN_PROGRESS with reason: The following resource(s) failed to update: [Service].

The prevailing theory right now is that the update is failing because the deploy of current task definitions isn't succeeding, because one of the two EC2 hosts for our containers is not healthy, and is not able to launch containers.

Consequences

  1. We are unable to update the Path and HealthCheckPath settings for the services, making it impossible for the containers to receive requests via the load balancer (and ultimately causing ECS to recycle the containers and try again)
  2. We cannot update the Desired Tasks to 2 from 0 for the services whose containers are deployable (ND, LE and DR).
  3. We will be unable to update the memory configuration for newer services (other than DR and TS, whose settings provide ample headroom), thus risking that some containers will be unable to load because they consume more RAM than allocated by default (100 MB).
MikeTheCanuck commented 6 years ago

It is possible that Ian has somehow massaged the unhealthy EC2 host sufficiently to be able to accept new containers - there are currently a handful of containers assigned to that host, some of which have survived the initial HealthCheck timeout period.