hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

Convert 2018 Transportation Systems API service to Fargate #263

Closed MikeTheCanuck closed 5 years ago

MikeTheCanuck commented 5 years ago

Addresses #244 for the 2018 Transportation Systems API. Mirrors the work in #259 and #260, and implements similar changes as hackoregon/hackoregon-aws-infrastructure#84. Uses the migration procedure documented here.

Acceptance Criteria

Tests that will confirm the container has successfully migrated:

  1. CloudFormation will report UPDATE_COMPLETE for the enhanced stack
  2. ECS will report (a) "Launch Type FARGATE" for the service, (b) 1 Task Running at the Task level, and (c) at the Task detail level a "Started at" timestamp that is at least 5 minutes into the past (which indicates that the container stayed RUNNING long enough to pass the ALB health check cycle)
  3. latest log in CloudWatch group for the associated service will show similar entries as current log output 10.180.9.210 [09/Aug/2019:20:52:04 +0000] GET /transportation-systems/ HTTP/1.1 200 23929 - ELB-HealthChecker/2.0 0.103165
  4. browser requests to http://service.civicpdx.org/transportation-systems/odot-crash-data/participants/ will display a Swagger-schema-prettified response with count: 124691 and detailed JSON objects in the results section
  5. latest CloudWatch logs will display a recent web request to /transportation-systems/odot-crash-data/participants/ with a 200 response code
MikeTheCanuck commented 5 years ago

First attempt to migrate to Fargate failed, for reasons that aren't obvious from any of the evidence in AWS:

Despite what is otherwise a happy path deployment, ECS declared the containers unhealthy at least 7 times in a row and would not schedule them into service - putting the /transportation-systems/ route and APIs in a state of 503.

MikeTheCanuck commented 5 years ago

Theory: the previous failure was specifically because:

Thus, what I will try next time is to re-specify the Path for the 2017 service as /transport/*, and otherwise try what I tried last time. If my hypothesis fits that the 2018 health checks were being sent to the 2017 service, then this should get us through the migration.

MikeTheCanuck commented 5 years ago

The migration continued to fail, despite the fact that request routing had been solved with a combination of distinct/non-overlapping Path variables and proper use of Listener Priority values:

This time I had enough caffeine in my bloodstream that I thought to look very closely at the timestamps:

For example:

By comparison, for the 2018 HA (Housing Affordability) service:

So what I believe I have to do - at least for now - is to find a way to increase the timeout and/or reduce the min number of health checks, so that the ECS has enough time to allow the container to respond to the minimum number of ALB health check requests.

MikeTheCanuck commented 5 years ago

Total anecdotal back-of-the-envelope math I'm doing by comparing one "failed" sequence with two "good" sequences (one from the TS container, another from the HA container) seems to indicate that the TS container is too slow by all of ~10 seconds in responding to the first ELB health check request.

All we'd need to do is give the container another 10-15 seconds (a little safety margin in case some startups are even slower than I've noticed) before ECS deems the container unhealthy, and we could safely support the 2018TS container in its current form. Unfortunately, I don't see anything we're doing in the master.yaml or 2019-fargate-api.yaml that sound like they'd have any control over this.

Aside: I had a deeper look at the DOCKERFILE and pip install sequences for the 2018TS and 2018HA containers, and there's very little extra in the TS container - in fact the HA container has a couple of extra steps as well, so unless there's something terrible about one or another of the unique packages/steps being done in the TS container, I can't figure why it's taking just long enough extra time to get going that it falls just past the ECS healthy timeout.

znmeb commented 5 years ago

What's HA?

MikeTheCanuck commented 5 years ago

HA = Housing Affordability project. Sorry, I added an expansion above

MikeTheCanuck commented 5 years ago

Well, I was able to get it to stabilize once more - it must be just on the threshold and a few seconds' variance seems to make the difference - so I'm going to close out this ticket, knowing that I've left copious notes for our future selves in case we have to "coddle" this container again in the future.