hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

Convert 2018 Neighborhood Development API service to Fargate #259

Closed MikeTheCanuck closed 4 years ago

MikeTheCanuck commented 4 years ago

We have a solid CD pattern for 2019 APIs, and we've successfully converted two 2017 APIs. Others can and should feel comfortable migrating the rest of those containers.

Now let's see what it takes to convert a 2018 container service to Fargate. I'm picking on Neighborhood Development semi-randomly because: (a) that project is completely inactive (b) none of its developers are currently part of Hack Oregon AFAIK (c) thus an outage of the API is unlikely to get in anyone's way while we complete the migration.

Addresses #244 for the 2018 Neighborhood Development API.

MikeTheCanuck commented 4 years ago

This commit has had the intended effect: loads the current 2018ND container image as a Fargate-hosted service and is answering requests from the gunicorn server in front of Django.

https://github.com/hackoregon/hackoregon-aws-infrastructure/commit/8c99de08a11bcc02be6f18393f0d83016c2c9568

However, we have a small problem:

So while the Fargate conversion works great, we end up with an API that isn't actually functional. That means there's some problems with the API (Django app) itself, not with the Fargate configuration, but it is a problem to be solved.

MikeTheCanuck commented 4 years ago

Unfortunately converting back to the EC2-based service deployment does nothing to further the API's health - the swagger schema renders fine, but the API endpoints all 404 - so at this point (even if the 2018ND API was working recently) the 2018ND API is down and need some in-house repair to get back to functional. The CloudFormation configuration is just as "working" in Fargate as it is in EC2-land, so this change stays.

Scratch that - the base API routes such as /api throw a 404, but the actual configured endpoints such as http://service.civicpdx.org/neighborhood-development/api/affordable_housing are working just fine. Stand down, alarms off, back to our usual programming.

MikeTheCanuck commented 4 years ago

Orchestrating and Troubleshooting the migration

The trickiest part of performing the switch from EC2 to Fargate is a problem of resource collisions such as Task Roles and Listener Priorities.

Problem 1: Using the same name for the Resource

In the master.yaml each Resource is given a unique name, and that name is used as a unique variable input for a variety of AWS objects, including the Task Role that we generate to ensure the Service's Task(s) have sufficient access to the AWS resources they need (e.g. SSM parameters).

When migrating an existing service from EC2 to Fargate, the natural temptation is to just copy/paste the existing resource block, comment out the old one, and update or add the Parameters needed for the new Fargate template. Unfortunately, by re-using the same name (e.g. 2018DR), CloudFormation will fail the stack update and rollback, reporting e.g. Embedded stack arn:aws:cloudformation:us-west-2:845828040396:stack/hacko-integration-2018DR-QYKK83G1ODA0/696db100-6104-11e8-ac84-50a68d01a68d was not successfully updated. Currently in UPDATE_ROLLBACK_IN_PROGRESS with reason: The following resource(s) failed to create: [TaskRole].

Screen Shot 2019-08-10 at 14 18 05

And in the embedded stack for the failed service, you'll see e.g. ecs-service-hacko-integration-2018DR-QYKK83G1ODA0 already exists in stack arn:aws:cloudformation:us-west-2:845828040396:stack/hacko-integration-2018DR-QYKK83G1ODA0/696db100-6104-11e8-ac84-50a68d01a68d

Screen Shot 2019-08-10 at 14 17 06

Resolution 1: Temporarily use a different name

So when creating the new Fargate-based Resource for the existing service, I temporarily gave it a different name e.g. 2018DiRe. Then after all the rest of the work was done (see Problem 2 and Resolution 2 below), I added a final commit to the PR to rename the resource back to its original name.

Problem 2: Using the same Listener Priorities

that when deleting the EC2 service at the same time as adding the related Fargate service, ECS often tries to add the ALB listeners for the new service before the old service's listeners have been removed. This creates a collision between two services trying to use the same Priority values (which must be unique within any ALB-based cluster - see current assignments here), such that the stack update fails a rolls back.

When digging into the details of the failed stack update, you'll see:

Screen Shot 2019-08-10 at 13 13 49 Screen Shot 2019-08-10 at 13 12 02

Resolution 2: Temporarily use different Priority values

The solution I found is to perform at least two updates to the stack:

MikeTheCanuck commented 4 years ago

Note: something we discovered and documented via #268 is that each of the 2018 API containers being migrated to Fargate also need to have the ecs-deploy.sh script updated to a more recent version as well.

This is now PR'd to the 2018 Neighborhood Development repo as https://github.com/hackoregon/neighborhoods-2018/pull/111