Convert 2018 Neighborhood Development API service to Fargate

MikeTheCanuck commented 4 years ago

We have a solid CD pattern for 2019 APIs, and we've successfully converted two 2017 APIs. Others can and should feel comfortable migrating the rest of those containers.

Now let's see what it takes to convert a 2018 container service to Fargate. I'm picking on Neighborhood Development semi-randomly because: (a) that project is completely inactive (b) none of its developers are currently part of Hack Oregon AFAIK (c) thus an outage of the API is unlikely to get in anyone's way while we complete the migration.

Addresses #244 for the 2018 Neighborhood Development API.

MikeTheCanuck commented 4 years ago

This commit has had the intended effect: loads the current 2018ND container image as a Fargate-hosted service and is answering requests from the gunicorn server in front of Django.

https://github.com/hackoregon/hackoregon-aws-infrastructure/commit/8c99de08a11bcc02be6f18393f0d83016c2c9568

However, we have a small problem:

while http://service.civicpdx.org/neighborhood-development/ renders the expected Swaggerified schema for the ND app
none of the API endpoints can show anything useful - /api, /census and /sandbox all throw a 404 (confirmed from the browser and from CloudWatch logs)

So while the Fargate conversion works great, we end up with an API that isn't actually functional. That means there's some problems with the API (Django app) itself, not with the Fargate configuration, but it is a problem to be solved.

MikeTheCanuck commented 4 years ago

Unfortunately converting back to the EC2-based service deployment does nothing to further the API's health - the swagger schema renders fine, but the API endpoints all 404 - so at this point (even if the 2018ND API was working recently) the 2018ND API is down and need some in-house repair to get back to functional. The CloudFormation configuration is just as "working" in Fargate as it is in EC2-land, so this change stays.

Scratch that - the base API routes such as /api throw a 404, but the actual configured endpoints such as http://service.civicpdx.org/neighborhood-development/api/affordable_housing are working just fine. Stand down, alarms off, back to our usual programming.

MikeTheCanuck commented 4 years ago

Orchestrating and Troubleshooting the migration

The trickiest part of performing the switch from EC2 to Fargate is a problem of resource collisions such as Task Roles and Listener Priorities.

Problem 1: Using the same name for the Resource

In the master.yaml each Resource is given a unique name, and that name is used as a unique variable input for a variety of AWS objects, including the Task Role that we generate to ensure the Service's Task(s) have sufficient access to the AWS resources they need (e.g. SSM parameters).

When migrating an existing service from EC2 to Fargate, the natural temptation is to just copy/paste the existing resource block, comment out the old one, and update or add the Parameters needed for the new Fargate template. Unfortunately, by re-using the same name (e.g. 2018DR), CloudFormation will fail the stack update and rollback, reporting e.g. Embedded stack arn:aws:cloudformation:us-west-2:845828040396:stack/hacko-integration-2018DR-QYKK83G1ODA0/696db100-6104-11e8-ac84-50a68d01a68d was not successfully updated. Currently in UPDATE_ROLLBACK_IN_PROGRESS with reason: The following resource(s) failed to create: [TaskRole].

And in the embedded stack for the failed service, you'll see e.g. ecs-service-hacko-integration-2018DR-QYKK83G1ODA0 already exists in stack arn:aws:cloudformation:us-west-2:845828040396:stack/hacko-integration-2018DR-QYKK83G1ODA0/696db100-6104-11e8-ac84-50a68d01a68d

Resolution 1: Temporarily use a different name

So when creating the new Fargate-based Resource for the existing service, I temporarily gave it a different name e.g. 2018DiRe. Then after all the rest of the work was done (see Problem 2 and Resolution 2 below), I added a final commit to the PR to rename the resource back to its original name.

Problem 2: Using the same Listener Priorities

that when deleting the EC2 service at the same time as adding the related Fargate service, ECS often tries to add the ALB listeners for the new service before the old service's listeners have been removed. This creates a collision between two services trying to use the same Priority values (which must be unique within any ALB-based cluster - see current assignments here), such that the stack update fails a rolls back.

When digging into the details of the failed stack update, you'll see:

an event with status of CREATE FAILED (or UPDATE FAILED) whose details will report e.g. Embedded stack arn:aws:cloudformation:us-west-2:845828040396:stack/hacko-integration-2018ND-1EOS87JFEM0C/ca130150-bba2-11e9-aa5a-0650fec6e554 was not successfully created: The following resource(s) failed to create: [ListenerRule, TaskRole, ListenerRuleTls].

Then in the failed embedded stack's events (which you can find by changing the Stacks filter from "Active" to "Deleted" and showing the nested stacks) you'll see a CREATE FAILED for a Listener-type resource with the detail e.g. Priority '84' is currently in use (Service: AmazonElasticLoadBalancingV2; Status Code: 400; Error Code: PriorityInUse; Request ID: ceffd7a6-bba2-11e9-ba8f-17534ef226a3)

Resolution 2: Temporarily use different Priority values

The solution I found is to perform at least two updates to the stack:

first update adds the Fargate service and deletes the EC2 service, but also allocates unused Priority values to the Fargate service (so that the new service's listeners will not collide with soon-to-be-departed listeners from the old service)
second update allocates the original Priority values to the Fargate service (now that the old service and its listeners have been deleted, and thus the Priority values have been freed up).

MikeTheCanuck commented 4 years ago

Note: something we discovered and documented via #268 is that each of the 2018 API containers being migrated to Fargate also need to have the ecs-deploy.sh script updated to a more recent version as well.

This is now PR'd to the 2018 Neighborhood Development repo as https://github.com/hackoregon/neighborhoods-2018/pull/111

hackoregon / civic-devops