Convert Endpoint Service Catalog to Fargate

MikeTheCanuck commented 4 years ago

This should be a trivial operation - it's a single-page nginx app after all - but of course there are complications.

Related to #244 but will require special work one way or another.

MikeTheCanuck commented 4 years ago

I attempted to start out with the bog-standard 2019-fargate-api.yaml template that all other current API containers use to deploy to ECS.

However, we ran into a neat little catch-22:

this app answers to service.civicpdx.org/ (i.e. the default route at that URL), so it cannot use Path: /* for routing - because that would then overtake the sub-routing we're doing for each of the API services (e.g. Budget has Path: /budget/*
but we can't just use Path: / either, because it's a well-structured SPA with a separate directory for non-HTML named __assets, which requires that we also implement a separate Listener that routes __assets/* to the same container (or rather, a pair of Listeners for http:// and https:// traffic)
with four Listeners and two Paths, we can't use the standard template

So I attempted to refactor the standard template into a new one that implemented the additional Listeners and Path, using the lessons of the existing service.yaml that we use currently to configure the Endpoints service on EC2.

Got it all working, everything looks like it'll work, but in the end CloudFormation isn't able to consider the deployment successful because the containers continually report "Task has exited" prematurely, with an error CannotPullContainerError: Error response from daemon: Get https://845828040396.dkr.ecr.us-west-2.amazsonaws.com/v2/: Unable to connect.

This error is new to us, so a little google-fu and we run across many explanations similar to this one: https://github.com/aws/amazon-ecs-agent/issues/1128

Which basically tells us that for whatever reason, the Task is not able to reach ECR over the Internet - i.e. there's a routing or SecurityGroup limitation so that all ECR download requests go answered.

MikeTheCanuck commented 4 years ago

Further:

I tried using the 2019-fargate-api.yaml template - that which has already worked fine for a large number of Fargate migrations, and which has never encountered the dreaded CannotPullContainerError let alone due to ECR connectivity issues. NO DICE - extracts the exact same error. OK, so then this is not template-specific.
I tried deploying a second container instance to one of our existing Fargate-deployed services. If there was an issue with that template or its dependencies in not being able to reach ECR, then spinning a second container image for one of the other containers should fail just as hard, yes? Nope, no problem.

MikeTheCanuck commented 4 years ago

It turns out this is a test. Did you see the error above? The answer is up there in black and white, and you'll know it when you see it.

....spoilers ahead....

Did you see it?

CannotPullContainerError: Error response from daemon: Get https://845828040396.dkr.ecr.us-west-2.amazsonaws.com/v2/: Unable to connect

No? Well then you're no worse than me at this.

Look closer: amazsonaws.com

Yep, somehow I slipped another character in, and the only way I noticed it was by pasting, then Cmd-Z'ing, and back and forth, until I noticed the reason why the EcrImage parameter was off by one character in length.

MikeTheCanuck commented 4 years ago

Once this was conquered, I then attempted to deploy a flattened version of the SPA, where all files were housed at the root route, so that rather than requesting service.civicpdx.org/__assets/index.css, we could instead request service.civicpdx.org/index.css - and thus be able to skip the additional unique configuration of a second set of Listeners for /__assets and the associated Path.

But alas, this is not to be. When we route only / to the container rather than /*, that literally does mean we only route service.civicpdx.org/ - any request for service.civicpdx.org/index.css (not to mention even service.civicpdx.org/index.html - which is the only resource that nginx is magically passing back to the requestor) gets "blocked" by ALB with a 503 response.

As I've said before, if we were to try to capture all requests for just the files at the root of the / route, there's no easy way to do that. (It even occurs to me that we could rename all the assets to a variant of /index* e.g. index.html, index.css, index.svg, and then setup a Path route for /index* to be sent to this container - but even that is just as unnecessarily brittle as the solution we have that works).

So I'm back to suffering with a one-off template (fargate-endpoints-catalog.yaml) that will likely only ever be used for one resource in all our assets on this HackOregon "stack". And while that works fine, it's definitely going to be a stumbling block for a future maintainer of this project (even future me is likely to get tripped up again).

hackoregon / civic-devops

Convert Endpoint Service Catalog to Fargate #280