hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

Convert Endpoint Service Catalog to Fargate #280

Closed MikeTheCanuck closed 4 years ago

MikeTheCanuck commented 4 years ago

This should be a trivial operation - it's a single-page nginx app after all - but of course there are complications.

Related to #244 but will require special work one way or another.

MikeTheCanuck commented 4 years ago

I attempted to start out with the bog-standard 2019-fargate-api.yaml template that all other current API containers use to deploy to ECS.

However, we ran into a neat little catch-22:

So I attempted to refactor the standard template into a new one that implemented the additional Listeners and Path, using the lessons of the existing service.yaml that we use currently to configure the Endpoints service on EC2.

Got it all working, everything looks like it'll work, but in the end CloudFormation isn't able to consider the deployment successful because the containers continually report "Task has exited" prematurely, with an error CannotPullContainerError: Error response from daemon: Get https://845828040396.dkr.ecr.us-west-2.amazsonaws.com/v2/: Unable to connect.

This error is new to us, so a little google-fu and we run across many explanations similar to this one: https://github.com/aws/amazon-ecs-agent/issues/1128

Which basically tells us that for whatever reason, the Task is not able to reach ECR over the Internet - i.e. there's a routing or SecurityGroup limitation so that all ECR download requests go answered.

MikeTheCanuck commented 4 years ago

Further:

MikeTheCanuck commented 4 years ago

It turns out this is a test. Did you see the error above? The answer is up there in black and white, and you'll know it when you see it.

....spoilers ahead....

Did you see it?

CannotPullContainerError: Error response from daemon: Get https://845828040396.dkr.ecr.us-west-2.amazsonaws.com/v2/: Unable to connect

No? Well then you're no worse than me at this.

Look closer: amazsonaws.com

Yep, somehow I slipped another character in, and the only way I noticed it was by pasting, then Cmd-Z'ing, and back and forth, until I noticed the reason why the EcrImage parameter was off by one character in length.

MikeTheCanuck commented 4 years ago

Once this was conquered, I then attempted to deploy a flattened version of the SPA, where all files were housed at the root route, so that rather than requesting service.civicpdx.org/__assets/index.css, we could instead request service.civicpdx.org/index.css - and thus be able to skip the additional unique configuration of a second set of Listeners for /__assets and the associated Path.

But alas, this is not to be. When we route only / to the container rather than /*, that literally does mean we only route service.civicpdx.org/ - any request for service.civicpdx.org/index.css (not to mention even service.civicpdx.org/index.html - which is the only resource that nginx is magically passing back to the requestor) gets "blocked" by ALB with a 503 response.

As I've said before, if we were to try to capture all requests for just the files at the root of the / route, there's no easy way to do that. (It even occurs to me that we could rename all the assets to a variant of /index* e.g. index.html, index.css, index.svg, and then setup a Path route for /index* to be sent to this container - but even that is just as unnecessarily brittle as the solution we have that works).

So I'm back to suffering with a one-off template (fargate-endpoints-catalog.yaml) that will likely only ever be used for one resource in all our assets on this HackOregon "stack". And while that works fine, it's definitely going to be a stumbling block for a future maintainer of this project (even future me is likely to get tripped up again).