Closed MikeTheCanuck closed 5 years ago
First attempt to migrate to Fargate failed, for reasons that aren't obvious from any of the evidence in AWS:
Despite what is otherwise a happy path deployment, ECS declared the containers unhealthy at least 7 times in a row and would not schedule them into service - putting the /transportation-systems/ route and APIs in a state of 503.
Theory: the previous failure was specifically because:
2018TS
EC2-based service, add a new temporarily-named 2018TransS
Fargate-based service - and have to assign some unused priorities for the new listeners (otherwise the new service generally comes up faster than the old one gets destroyed, and they collide trying to assign the same priorities to two sets of listeners)2018TransS
with e.g. Priority = 138 (at /transportation-systems/) alongside 2017TS
with Priority = 58, all requests by the ALB health checker for /transportation-systems/ are being send to the 2017 container (since it has lower priority, and since "transportation-systems" matches the `/transport` pattern match)Thus, what I will try next time is to re-specify the Path for the 2017 service as /transport/*
, and otherwise try what I tried last time. If my hypothesis fits that the 2018 health checks were being sent to the 2017 service, then this should get us through the migration.
The migration continued to fail, despite the fact that request routing had been solved with a combination of distinct/non-overlapping Path
variables and proper use of Listener Priority
values:
This time I had enough caffeine in my bloodstream that I thought to look very closely at the timestamps:
2019-fargate-api.yaml
Health Check timing configuration was intentionally tightened up to reduce complaints we'd heard in the past that it took so long for newly-deployed containers to become activeFor example:
Debug:
) was at 15:50:23Starting gunicorn 19.7.0
was loggedELB-HealthChecker/2.0
request wasn't logged until 15:50:58, by which time it was far too late.By comparison, for the 2018 HA (Housing Affordability) service:
So what I believe I have to do - at least for now - is to find a way to increase the timeout and/or reduce the min number of health checks, so that the ECS has enough time to allow the container to respond to the minimum number of ALB health check requests.
Total anecdotal back-of-the-envelope math I'm doing by comparing one "failed" sequence with two "good" sequences (one from the TS container, another from the HA container) seems to indicate that the TS container is too slow by all of ~10 seconds in responding to the first ELB health check request.
All we'd need to do is give the container another 10-15 seconds (a little safety margin in case some startups are even slower than I've noticed) before ECS deems the container unhealthy, and we could safely support the 2018TS container in its current form. Unfortunately, I don't see anything we're doing in the master.yaml
or 2019-fargate-api.yaml
that sound like they'd have any control over this.
Aside: I had a deeper look at the DOCKERFILE
and pip install
sequences for the 2018TS and 2018HA containers, and there's very little extra in the TS container - in fact the HA container has a couple of extra steps as well, so unless there's something terrible about one or another of the unique packages/steps being done in the TS container, I can't figure why it's taking just long enough extra time to get going that it falls just past the ECS healthy timeout.
What's HA?
HA = Housing Affordability project. Sorry, I added an expansion above
Well, I was able to get it to stabilize once more - it must be just on the threshold and a few seconds' variance seems to make the difference - so I'm going to close out this ticket, knowing that I've left copious notes for our future selves in case we have to "coddle" this container again in the future.
Addresses #244 for the 2018 Transportation Systems API. Mirrors the work in #259 and #260, and implements similar changes as hackoregon/hackoregon-aws-infrastructure#84. Uses the migration procedure documented here.
Acceptance Criteria
Tests that will confirm the container has successfully migrated: