hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

Enable Fargate deployment in HackOregon CF stack for new docker containers #238

Closed MikeTheCanuck closed 4 years ago

MikeTheCanuck commented 5 years ago

Summary

Docker containers for 2019 projects need a means of deployment. We will take this opportunity to invest in adding Fargate to our CloudFormation stack.

Impact

Goals addressed: S2, S4

Tasks

Definition of Done

At least one of the 2019 containers deploys successfully to ECS as a Fargate service.

MikeTheCanuck commented 5 years ago

Response to Nathan's inquiry on Slack https://civicsoftware.slack.com/archives/CFTR2UVAP/p1559403988010600:

We have some work ahead of us to be able to extend on the ECS deployment. Our goal is to migrate new containers to Fargate (which have theoretically infinite CPU & memory resources), as opposed to the current EC2-based approach - so that we can stop spending so much precious time hand-curating the balance of resources, and allocation of containers, between the EC2 hosts (which have finite resources). We spend more and more time each year making sure that a new container “fits in” with the current stack, because if something goes wrong during a deployment and all container instances get allocated to only one of our two hosts, there’s literally not enough room (i.e. RAM) for them all to run at once.

First major goal along the way is to be able to deploy and troubleshoot with a second CloudFormation stack, which I’m working on presently. Next is to figure out how to add a Fargate-hosted container to our CloudFormation “stack” (suite of YAML templates), and then we’ll get the ECR, creds and SSM parameters instantiated for the 2019 containers.

I’ve solved the first blocking issue here: https://github.com/hackoregon/hackoregon-aws-infrastructure/pull/62

And now I’m on to the next blocking issue here: https://github.com/hackoregon/hackoregon-aws-infrastructure/issues/63

Which may also require solving this issue, which I’m working on now: https://github.com/hackoregon/hackoregon-aws-infrastructure/issues/64

MikeTheCanuck commented 5 years ago

Status update: I'm running into repeated rollbacks when deploying the proposed stack with infra issues 62 and 64 included:

So I decided to go back to the current code in master, which is what succeeded a week ago in deploying a 2nd test stack: https://github.com/hackoregon/hackoregon-aws-infrastructure/tree/e82b59880c65adfce767bd6e6ca89ca773416792

Result? Even that failed when creating four of the Services - transportService, EmerreponseService, HomelessService and Civic2018Service (and then CF just cancelled the creation of the other resources - BudgetService, Civic2017Service, 2018DR, 2018TS, HousingService, 2018LE, 2018HA, 2018ND, Endpoints), e.g. Embedded stack arn:aws:cloudformation:us-west-2:845828040396:stack/miketesting20190601-1433-transportService-5B40OCZ6GYP2/29478ca0-84b6-11e9-a632-02ee71065ed8 was not successfully created: The following resource(s) failed to create: [ServiceRole].

MikeTheCanuck commented 5 years ago

Today I tried again with the set of templates in master branch - this time it failed on create for another four Services, only three of the four the same as the last attempt yesterday: transportService, EmerresponseService, Civic2018Service the same, but the fourth this time around is HousingService.

I'm drawing a couple of conclusions from this:

What's confusing is that the same templates - or something very closely related - deployed successfully a week ago. I was able to stand up a second stack in the same account alongside the first stack - and using the same configuration - so the templates themselves can't be the cause of this problem.

I'm a bit stumped, and don't have the time to invest in a thorough scientific examination of every piece of this stack. It has worked in past years, and in the past week, and I can't believe some AWS dependency has suddenly changed such that this whole stack no longer works.

Couple of ideas I'm toying with:

This could be very time-consuming, and it's time I can hardly afford to spend, so I'm going to let these ideas bake in the back of my mind and evaluate where to get the best return on time investment.

MikeTheCanuck commented 5 years ago

I’ve spent some time this morning trying again to get the CF stack to deploy successfully. The past two weekends I’ve been trying to get this to work, and there’s a lot of variability to what fails to get created, which has been frustrating.

The most common aspect of each of these failures/rollbacks that I notice today is that CF is encountering failures to create the ServiceRole for each of the services (i.e. ECS objects that run 1+ task to house the containers of interest).

Looking at the code, ServiceRole is trying to create an AWS object of Type: AWS::IAM::Role. Two possibilities come to mind that could explain why I’m unable now to create the ServiceRole that wasn’t a problem a couple of weeks ago:

  1. My IAM permissions have somehow changed, and I no longer have the ability to create AWS::IAM::Role. This seems unlikely - I’d heard no such change, and generally @michael has only added permissions where required, but I need to call this out as it would certainly cause such failures.
  2. The other thing that occurs to me this morning is name length constraints - I recall that we’d had problems in the past with the names of the Services themselves and had to shorten them, and I found this doc that outlines the limits: https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_iam-limits.html. Role Names are max 64 characters, but there’s an unexpected mention that ” then the combined Path and RoleName cannot exceed 64 characters”.

I don’t know how long the Path prefix is here, but I know that the stack names I’ve been creating for these recent tests are ~24 characters, and the Role Name specified in the CF code is ecs-service-${AWS::StackName} - which either means “(service name)” or the string ecs-service- - so possibly another 12 characters or more. So if the Path is somehow 28 characters or more, then we’d be colliding with the name limit.

The shortest stack name whose stack creation rolled back due to a ServiceRole creation failure is miketestingsundayaft (20 characters), and the longest I know for sure has worked all along is hacko-integration (17 characters), so I’ll try a test now using test just to account for any wildly new constraints since our working stack was instantiated a couple of years ago

MikeTheCanuck commented 5 years ago

Further research while I'm waiting for the stack to build:

Looked in IAM for the existing services, and I see our existing stack's service roles are named as follows:

arn:aws:iam::845828040396:role/ecs-service-hacko-integration-Civic2018Service-7ORXOJDUQ5MW

(so ecs-service is a string, not a variable - duh)

And if I count the characters in just the name of the role, that comes out to 59 characters - so hacko-integration + 5 more characters is the upper bounds of the stack name size. Thus stacks with names such as miketesting20190608-1105 are definitely going to cause problems. Lesson learned.

MikeTheCanuck commented 5 years ago

Well damn, score another point against Mike’s propensity for semantically-meaningful trails and clues for others to follow…

Looks like my overly-descriptive stack names were what caused the repeated failures to stand up a test stack with the current code. Stack creation is still a bit finicky, but it can definitely succeed with the current code if I use a short stack name.

Edit: digging deeper into the stack rollbacks and finding a nested "stack" e.g. Emerresponseservice that blew up during these experiments, I discovered this buried error message that explicitly accounts for the problem: 1 validation error detected: Value 'ecs-service-miketesting20190602-1408-EmerreponseService-ANL3Q2UJ8SAB' at 'roleName' failed to satisfy constraint: Member must have length less than or equal to 64 (Service: AmazonIdentityManagement; Status Code: 400; Error Code: ValidationError; Request ID: 9dcc78d5-95aa-42ff-a39d-54eac01a97bf)

Another problem in some stack creation attempts was with the VPC creation, so we're not totally out of the woods yet, but it's time to see about adding a Fargate resource and see if that works.

MikeTheCanuck commented 5 years ago

Progress: I've been able to integrate and properly configure a Fargate service into the CF stack and have the whole stack successfully deploy.

This means that from here on out, we're less concerned with "can we get a Fargate container deployed to AWS" and more about two questions:

So far the second issue is more immediate, since we can see from Cloudwatch logs that the existing container I chose for this service isn't happy: Unable to locate credentials. You can configure credentials by running "aws configure".

Which leads to django.core.exceptions.ImproperlyConfigured: The SECRET_KEY setting must not be empty.

MikeTheCanuck commented 5 years ago

241 addressed the stack-wide problem of containers not being authorized to read from SSM.

Unfortunately, that hasn't resolved the problem of "unable to locate credentials" in the Fargate task's container.

The major difference I noticed between the Fargate template we got from our friends at AWS Labs and the existing EC2 task's templates is the lack of a ServiceRole, and I have a feeling that even though the ECS cluster has the necessary access, there's a strong possibility that the individual ECS tasks don't get that security context passed in without some explicit work at the task level.

I notice that the ServiceRole definition in the budget-service task's template is preceded by a comment that states:

This IAM Role grants the service access to register/unregister with the Application Load Balancer (ALB). It is based on the default documented here: http://docs.aws.amazon.com/AmazonECS/latest/developerguide/service_IAM_role.html

I haven't had a chance to dig into this further and see if this is the only authorization granted here, or if there's a whole lot more that no one in the HackOregon history has had need to understand. It would be pretty amazing if the ec2:* Actions that are authorized there don't have some need.

Couple of ways to experiment and narrow in on the source of the problem:

If this doesn't surface any leads, then we're going to have to dig further into the differences in security context between an EC2 task and a Fargate task, and figure out what makes it possible for aws ssm get-parameters to work fine in an EC2 task but fail in a Fargate task. There's no way that Fargate requires its users to hard-code security credentials into a container - Fargate's meant to require less overhead than EC2-based ECS, not more - but it's possible that Fargate imposes a different security model, or has constraints that are documented somewhere that mean we have to take a different approach.

MikeTheCanuck commented 5 years ago

Does the Fargate task have any IAM assumed role? #241 shows its error as:

An error occurred (AccessDeniedException) when calling the GetParameters operation: User: arn:aws:sts::845828040396:assumed-role/miketesting20190527-ECSRole-us-west-2/i-081cc40f40abe42cd is not authorized to perform: ssm:GetParameters

...so that the resolution of this error in https://github.com/hackoregon/hackoregon-aws-infrastructure/pull/66 is intended to ensure that any ECS task's assumed role would inherit that SSM policy.

But what if there's NO role attached to the task, to which AWS could attach the policy?

And what if the Fargate-based task's definition started with a ServiceRole definition that defined no privileges, but was just there as an empty shell to which to inherit the SSM policy? Let's try that.

DingoEatingFuzz commented 5 years ago

We learned last week that the way to handle AWS roles when migrating from EC2 tasks to Fargate tasks is to assign a TaskExecutionRole to the task definition directly (not the service or the cluster). This represents the AWS user that the container gets.

MikeTheCanuck commented 4 years ago

Considering this closed with PR https://github.com/hackoregon/hackoregon-aws-infrastructure/pull/71.

Residual issues to cleanup, that don't in any way contradict the goal of this ticket - to create a CloudFormation enhancement that enables successful deployment of containers on Fargate:

254 #255