Convert 2018 Transportation Systems API service to Fargate

MikeTheCanuck commented 5 years ago

Addresses #244 for the 2018 Transportation Systems API. Mirrors the work in #259 and #260, and implements similar changes as hackoregon/hackoregon-aws-infrastructure#84. Uses the migration procedure documented here.

Acceptance Criteria

Tests that will confirm the container has successfully migrated:

CloudFormation will report UPDATE_COMPLETE for the enhanced stack
ECS will report (a) "Launch Type FARGATE" for the service, (b) 1 Task Running at the Task level, and (c) at the Task detail level a "Started at" timestamp that is at least 5 minutes into the past (which indicates that the container stayed RUNNING long enough to pass the ALB health check cycle)
latest log in CloudWatch group for the associated service will show similar entries as current log output 10.180.9.210 [09/Aug/2019:20:52:04 +0000] GET /transportation-systems/ HTTP/1.1 200 23929 - ELB-HealthChecker/2.0 0.103165
browser requests to http://service.civicpdx.org/transportation-systems/odot-crash-data/participants/ will display a Swagger-schema-prettified response with count: 124691 and detailed JSON objects in the results section
latest CloudWatch logs will display a recent web request to /transportation-systems/odot-crash-data/participants/ with a 200 response code

MikeTheCanuck commented 5 years ago

First attempt to migrate to Fargate failed, for reasons that aren't obvious from any of the evidence in AWS:

CloudFormation declared the Update a success and completed it
ECS was able to start up the new container and test it repeatedly for health
All health check requests were met with a 200 response
no errors were seen in the CW logs

Despite what is otherwise a happy path deployment, ECS declared the containers unhealthy at least 7 times in a row and would not schedule them into service - putting the /transportation-systems/ route and APIs in a state of 503.

when I migrated back to the old EC2-based configuration, and it stood up a new service with 2x containers, the containers were pulled from the latest ECR image and that survived the deployment and health check phase
all the same UserWarning messages in the logs are seen in both the EC2 and Fargate deployment scenarios, and no other even trivial warnings or errors are seen in the CloudWatch logs

MikeTheCanuck commented 5 years ago

Theory: the previous failure was specifically because:

we have both the 2017 "Transport" and the 2018 "Transportation-Systems" APIs in the same stack
the 2017 "Transport" service is currently configured with ListenerRules that use priority 58 & 59, and the 2018 "Transportation-Systems" service is configured with priority 38 & 39
the 2017 "Transport" service is configured to receive all requests on the service.civicpdx.org/transport Path, and the 2018 service uses service.civicpdx.org/transportation-systems/
when I perform the migration, I have to temporarily assign a new set of priorities to the new service - I comment out the 2018TS EC2-based service, add a new temporarily-named 2018TransSFargate-based service - and have to assign some unused priorities for the new listeners (otherwise the new service generally comes up faster than the old one gets destroyed, and they collide trying to assign the same priorities to two sets of listeners)
this weekend I've been just incrementing the priorities by 100 since all our priorities are currently numbered < 100
thus, when I deployed 2018TransS with e.g. Priority = 138 (at /transportation-systems/) alongside 2017TS with Priority = 58, all requests by the ALB health checker for /transportation-systems/ are being send to the 2017 container (since it has lower priority, and since "transportation-systems" matches the `/transport` pattern match)

Thus, what I will try next time is to re-specify the Path for the 2017 service as /transport/*, and otherwise try what I tried last time. If my hypothesis fits that the 2018 health checks were being sent to the 2017 service, then this should get us through the migration.

MikeTheCanuck commented 5 years ago

The migration continued to fail, despite the fact that request routing had been solved with a combination of distinct/non-overlapping Path variables and proper use of Listener Priority values:

I tried 3-4 times last weekend, and every time the container would receive health check requests from ALB, but it was still getting killed off by ALB “due to failed health checks”
today I tried again and one time it succeeded - but of course, the migration sequence I use has to be done twice (to account for shared resources - Service name and Listener Priorities - that I have to temporarily rename/workaround), and the second attempt failed the same way

This time I had enough caffeine in my bloodstream that I thought to look very closely at the timestamps:

turns out that the health check requests are fine (which agrees with my earlier observation that I was seeing expected ALB requests logged in associated CloudWatch logs), but that the 2018 TS container takes so long to get started that there isn’t enough time for the health check sequence to obtain its minimum number of successful results before the timeout kicks in
the 2019-fargate-api.yaml Health Check timing configuration was intentionally tightened up to reduce complaints we'd heard in the past that it took so long for newly-deployed containers to become active
it currently uses these values on the HC parameters:
HealthCheckIntervalSeconds: 10
HealthCheckTimeoutSeconds: 5
HealthyThresholdCount: 2
i.e. once the container is sufficiently ready to start firing health checks at it, the first request fires, then a second 10 seconds later
it appears (not by any explicit configuration on my part, but rather by just adding up the timings in the logs) that under this configuration, ALB/ECS gives the container ~70-75 seconds to respond to the first health check

For example:

according to ECS the Task was started at 15:49:35, and ECS recorded that the task was unhealthy at 15:50:46
however, according to CloudWatch logs the first recorded output (Debug:) was at 15:50:23
then a whole mess of copy and post-processing steps were logged until 15:50:47 when the first gunicorn entry Starting gunicorn 19.7.0 was logged
the first ELB-HealthChecker/2.0 request wasn't logged until 15:50:58, by which time it was far too late.

By comparison, for the 2018 HA (Housing Affordability) service:

ECS Task start @ 18:48:12
gunicorn start @ 18:49:17
first ELB request @ 18:49.26, second @ 18:49:35

So what I believe I have to do - at least for now - is to find a way to increase the timeout and/or reduce the min number of health checks, so that the ECS has enough time to allow the container to respond to the minimum number of ALB health check requests.

MikeTheCanuck commented 5 years ago

Total anecdotal back-of-the-envelope math I'm doing by comparing one "failed" sequence with two "good" sequences (one from the TS container, another from the HA container) seems to indicate that the TS container is too slow by all of ~10 seconds in responding to the first ELB health check request.

All we'd need to do is give the container another 10-15 seconds (a little safety margin in case some startups are even slower than I've noticed) before ECS deems the container unhealthy, and we could safely support the 2018TS container in its current form. Unfortunately, I don't see anything we're doing in the master.yaml or 2019-fargate-api.yaml that sound like they'd have any control over this.

Aside: I had a deeper look at the DOCKERFILE and pip install sequences for the 2018TS and 2018HA containers, and there's very little extra in the TS container - in fact the HA container has a couple of extra steps as well, so unless there's something terrible about one or another of the unique packages/steps being done in the TS container, I can't figure why it's taking just long enough extra time to get going that it falls just past the ECS healthy timeout.

znmeb commented 5 years ago

What's HA?

MikeTheCanuck commented 5 years ago

HA = Housing Affordability project. Sorry, I added an expansion above

MikeTheCanuck commented 5 years ago

Well, I was able to get it to stabilize once more - it must be just on the threshold and a few seconds' variance seems to make the difference - so I'm going to close out this ticket, knowing that I've left copious notes for our future selves in case we have to "coddle" this container again in the future.

hackoregon / civic-devops

Convert 2018 Transportation Systems API service to Fargate #263

Acceptance Criteria