Convert 2017 & 2018 API services from EC2 to Fargate

MikeTheCanuck commented 5 years ago

Summary

Migrate all existing containers to Fargate

Definition of Done

All existing containers are running as Fargate tasks in the existing infrastructure.

MikeTheCanuck commented 5 years ago

I've been working the problem of making Fargate work in our existing CF stack (#238) and have it to the point where it appears that the container deploys and starts up as an ECS task, and is able to configure itself using remote resources (e.g. SSM parameters).

Next is to come at it from the opposite angle - can we host a known-good container as a Fargate task and have it successfully respond to outside requests? To weave these two together, the best approach is to try converting an existing EC2-based task to Fargate. I've picked the Budget 2017 API container as (a) I know how it's supposed to work, having been on the team and (b) it's very unlikely to be getting traffic so downtime is at its most tolerable.

I've adapted what I believe to be a working Fargate template to the Budget-Service and at this point it appears that once the container is up and running, ALB's health check tests are returning a 400 error:

service hacko-integration-FargateBudget-1JWOQ5F2TR1R1-Service-5BYKU2RAR8PL (port 8000) is unhealthy in target-group hacko-Targe-1VL2ENM8F80KM due to (reason Health checks failed with these codes: [400]).

MikeTheCanuck commented 5 years ago

I had a theory that this was a 400 error due to the Security Groups configuration only allowing access to ports 80 and 443, but not to the port (8000) on which the containerized app is listening - and to which the Health Check must be able to connect to verify that the container is ready to take requests.

The Security Group was explicitly configured on the Fargate examples I modelled this after (but I didn't know why, just thought "there must be some reason why Fargate requires this"). So I tried commenting out all references in the master and the task templates to the Security Group.

That still didn't result in a container that was deemed "healthy" even if it was deployed into the cluster and even though according to CloudWatch logs the application in the container has completed all its needed startup (i.e. I can see no errors in the logs, and I see that the app is Listening at: http://0.0.0.0:8000 (10)). However this time ECS reports a different error when deregistering the task: service hacko-integration-FargateBudget-V9Y1T0QMZL36-Service-1EAD5R8VV1XAC (port 8000) is unhealthy in target-group hacko-Targe-1WBYRP9UXOBPB due to (reason Request timed out).

So I'm going to dig deeper into the Security Group configuration and ensure it's explicitly allowing incoming traffic on the port(s) that the container is configured to listen on.

MikeTheCanuck commented 5 years ago

Note: I'm currently working from this experimental branch (i.e. I don't expect to PR this or merge to master): https://github.com/hackoregon/hackoregon-aws-infrastructure/tree/civic-devops-244

MikeTheCanuck commented 5 years ago

I've uncommented the Security Group references, generated a new security group, and explicitly granted access to the following port combos (from:to):

80:80
443:443
80:8000
8000:8000

...and we're back to the original issue - i.e. the Security Group is getting created as well as it was last week, but the extra port combos doesn't solve the "400" problem: service hacko-integration-FargateBudget-V9Y1T0QMZL36-Service-1EAD5R8VV1XAC (port 8000) is unhealthy in target-group hacko-Targe-1WBYRP9UXOBPB due to (reason Health checks failed with these codes: [400]).

The very first hit on that error message leads me back to this article that got me started down this road, so I'll take another stab at other possibilities: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html

MikeTheCanuck commented 5 years ago

Assumption: "reason Health checks failed with these codes: [400]" is indicating that some listener in the cluster responded with an HTTP 400 response code ("Bad request").

We can eliminate the container application itself, since I can verify from CloudWatch logs that the container never records an incoming HTTP request - the final entries in the CloudWatch log for each instance of this "unhealthy" container are:

126 static files copied to '/code/staticfiles', 126 post-processed.
[2019-06-29 21:02:13 +0000] [10] [INFO] Starting gunicorn 19.7.1
[2019-06-29 21:02:13 +0000] [10] [INFO] Listening at: http://0.0.0.0:8000 (10)
[2019-06-29 21:02:13 +0000] [10] [INFO] Using worker: gevent
[2019-06-29 21:02:13 +0000] [13] [INFO] Booting worker with pid: 13
/code/budget_proj/wsgi.py:19: RuntimeWarning: Patching more than once will result in the union of all True parameters being patched
from gevent import monkey; monkey.patch_all(thread=False)

Which are the same as for a healthy instance of this same container without the addition of any entries like the following:

10.180.21.196 [29/Jun/2019:21:04:45 +0000] GET /disaster-resilience/ HTTP/1.1 200 23930 - ELB-HealthChecker/2.0 0.125068
10.180.13.204 [29/Jun/2019:21:05:07 +0000] GET /disaster-resilience/ HTTP/1.1 200 33349 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0 0.071944

MikeTheCanuck commented 5 years ago

SecurityGroup? OK, so I've been messing with the SecurityGroup assigned to the task and seeing if it was too restrictive. After a number of iterations I finally opened it up to all protocols and IPs, and we're still getting:

service hacko-integration-FargateBudget-V9Y1T0QMZL36-Service-1EAD5R8VV1XAC (port 8000) is unhealthy in target-group hacko-Targe-1WBYRP9UXOBPB due to (reason Health checks failed with these codes: [400]).

And I'm still seeing no evidence in CloudWatch that the container app is seeing any requests.

Couple of possibilities:

The Security Group I've been editing isn't the one that's actually attached to the Task (unlikely, but possible)
The 400 error is referencing an issue at some layer of the cluster in front of the containerized app (Django itself, Docker, or the "Fargate abstraction layer" (whatever that is or isn't))
The 400 error is referring to something much different than "the server tells the health check app that there was something wrong with the request".

MikeTheCanuck commented 5 years ago

Subnets? Here's another thought, after trolling through the templates and looking at the stuff that isn't the same as the EC2 services: what about the Subnet to which the Fargate task is deployed? I see that it's specified as Subnets: !GetAtt VPC.Outputs.PrivateSubnets

...in the master yaml, and it's not immediately clear if that's the same subnets as for the EC2 tasks, or if they're on the PublicSubnets.

Digging through the ECS-cluster.yaml and its params from master.yaml, in fact it is the same subnets:

    ECS:
        Type: AWS::CloudFormation::Stack
        Properties:
            TemplateURL: https://s3-us-west-2.amazonaws.com/hacko-infrastructure-cfn/infrastructure/ecs-cluster.yaml
            Parameters:
                ...
                Subnets: !GetAtt VPC.Outputs.PrivateSubnets

MikeTheCanuck commented 5 years ago

Network? Next suspect on the list is this section of the Service definition:

            NetworkConfiguration:
                AwsvpcConfiguration:
                    AssignPublicIp: ENABLED
                    SecurityGroups:
                      - !Ref SecurityGroup
                    Subnets:
                      - !Select [ 0, !Ref Subnets ]
                      - !Select [ 1, !Ref Subnets ]

That bit about AssignPublicIp: ENABLED makes me wonder if we're giving AWS incompatible instructions - "Private Subnet" but assigning the service a Public IP address?

MikeTheCanuck commented 5 years ago

Network Interfaces? Finally in thoroughly crawling the resources EC2, I come across this page, which makes me wonder (a) is there a way to find out what network interface & subnet are connected to the Fargate task, and (b) do we have it hooked up correctly? (Just because things look fine to the eyeball by comparing the template content to the EC2 tasks doesn't make it so.)

MikeTheCanuck commented 5 years ago

Override the health check? This article gave me a crazy idea: https://www.reddit.com/r/aws/comments/8grpgk/what_does_health_checks_failed_with_these_codes/

What if (even just temporarily, to get one level deeper into the cluster & logs) we told the Health Check to accept 400 as "healthy"?

Edit: that was unexpected - seems to be working but not in any explainable way:

Good: the health check passes now, and ECS reports that the task has reached a steady state hours ago
Good: there are now responses to http://service.civicpdx.org/budget (though still getting a 503 on https://service.civicpdx.org/budget)
Bad: I cannot find any trace of the actual requests I'm making (to verify the container is alive) in the associated CloudWatch logs - latest log was created 15 hours ago, and it doesn't show any activity (neither the regular polling from ALB's health check activity, nor my requests)
Bad: I have no way of determining exactly where the API that is responding is running, but I can confirm that the CloudWatch log group for the old EC2-based Budget tasks is definitely not there

By all measures, the cluster is successfully sending the /budget requests to a container that runs the Budget API, but I cannot see any confirmation in CloudWatch that the Fargate task that's healthy and running is the one that's responding to those requests.

In the past we've run into a situation like this, and though I can't remember the details precisely, I do recall that the lesson was "don't trust that you're running the container you think you are until you can track it down and prove it".

MikeTheCanuck commented 5 years ago

Update on the lack of CloudWatch logs for the Budget task via Fargate:

I've redeployed the EC2-based Budget task and confirmed that it also doesn't show any incoming requests in the CloudWatch logs, so this may not be indicative of a problem.

Reviewing which Django apps are showing incoming requests in their CW logs:

logging GETs:
- 2017 Emergency (e.g. 10.180.21.196 [29/Jun/2019:16:07:41 +0000] GET /emergency/ HTTP/1.1 200 10236 - ELB-HealthChecker/2.0 0.173346)
- Endpoints
- 2017 Civic front end (but only in the form Servicing request for /2017/)
- 2017 Transportation (DEBUG logging level, odd)
- 2018 Civic front end (similarly only the string Servicing request for /2018/)
- 2018 Disaster Resilience
- 2018 Housing Affordability
- 2018 Local Elections
- 2018 Neighbourhood Development
- 2018 Transportation Systems
not logging GETs:
- 2017 Budget
- 2017 Homelessness
- 2017 Housing

So next I'll try deploying Emergency Response in Fargate and verify that I'm seeing requests logged in CloudWatch. If good, then it'll be time to start setting up the 2019 API containers (though in a way that will allow Django apps returning 400 to get scheduled into service, which is a piece of Tech Debt we're liable to forget if I don't log it soon).

MikeTheCanuck commented 5 years ago

OK I think we nailed it:

I'm seeing health check entries and external requests logged in the Fargate-deployed Emergency Response 2017 API's CloudWatch logs:
I noticed that I hadn't added the ListenerRuleTls to the Budget Fargate service yaml, so I added that to both Budget and Emergency, and they're both now responding on https://service.civicpdx.org as well as http://service.civicpdx.org

Both the 2017 Budget and 2017 Emergency Response services are now deploying and remaining healthy on Fargate. This PR enabled the whole shebang: https://github.com/hackoregon/hackoregon-aws-infrastructure/pull/69

Migrating the remaining EC2-based ECS services should thus just be a matter of copying the pattern established with these two.

MikeTheCanuck commented 4 years ago

Status of this effort: 2017 Emergency Services container is in good shape. It survived even the refactoring effort that is underway here: https://github.com/hackoregon/hackoregon-aws-infrastructure/issues/76

However, 2017 Budget container was not so lucky. Container is deemed unhealthy by ALB under that refactored deploy.

Referencing @DingoEatingFuzz 's recent document https://github.com/hackoregon/civic-devops/blob/master/docs/HOWTO-Update-an-API-repo-to-use-Fargate.md, the most likely culprit is some out-of-sync'ness with the recent changes that Michael made to Emergency Services in PR's 124-126 here: https://github.com/hackoregon/emergency-response-backend/pulls?q=is%3Apr+is%3Aclosed

As well as the direct commits on July 1, 2019: https://github.com/hackoregon/emergency-response-backend/commits/master

HOWEVER, also important to note that the Budget container eventually stabilizes and starts answering healthily when deployed on the current master branch configuration - which is this commit: https://github.com/hackoregon/hackoregon-aws-infrastructure/tree/3a0fe7b1f1481c421097ab3335653bbeb26aecda

Thus, there's some difference between the 2017-budget-api.yaml (and its pass-ins from master.yaml) and the 2017-fargate-api.yaml (and its associated pass-ins) that causes the same container image to succeed with the former and fail with the latter.

MikeTheCanuck commented 4 years ago

Further status on updating the Budget container: along with the fixes made to emergency-service, there have been a number of changes made in a branch on the Budget repo to (a) figure out how the Travis sequence works and (b) get the pytest test cases to pass in Travis.

What I'm finding is that the Emergency Services container doesn't actually leverage SSM parameters to pass its build & test, which explains why it still deploys to ECR without actually having permission to read from SSM.

Additional changes I've had to make that weren't recently done for the Emergency container (but may have been present for longer) include:

adding two environment variables to the Travis env vars: DOCKER_REPO_NAMESPACE and PROJECT_NAME
adding a ton of extra environment variables to the relevant travis-docker-compose.yml (to allow docker-compose to pass them into the container)
changing the AWS creds configured in Travis env vars to the IAM user 2018-ecs-ecr-deployer account.
EDIT: update the ECS_SERVICE_NAME variable in Travis to allow Travis to deploy the container image from ECR to ECS

That IAM user will need to be augmented with a policy that allows access to the /production/2017/API/ SSM namespace, like follows:

{
     "Version": "2012-10-17",
     "Statement": [
         {
             "Effect": "Allow",
             "Action": [
                 "ssm:DescribeParameters"
             ],
             "Resource": "*"
         },
         {
             "Sid": "Stmt1482841904000",
             "Effect": "Allow",
             "Action": [
                 "ssm:GetParameters"
             ],
             "Resource": [
                 "arn:aws:ssm:us-west-2:845828040396:parameter/production/2017/*"
             ]
         },
         {
             "Sid": "Stmt1482841948000",
             "Effect": "Allow",
             "Action": [
                 "kms:Decrypt"
             ],
             "Resource": [
                 "arn:aws:kms:us-west-2:845828040396:key/0280a59b-d8f5-44e0-8b51-80aec2f27275"
             ]
         }
     ]
 }

(See https://github.com/hackoregon/civic-devops/issues/114#issuecomment-391201554 for the first time we setup such a policy.)

MikeTheCanuck commented 4 years ago

With PR 173 (https://github.com/hackoregon/team-budget/pull/173) in Team Budget repo, we have successfully refactored to get build and test steps working again (even with the new SSM world view of configuration management), and for the validated container image to be published to ECR.

However there is an error during the deploy stage that prevents Travis from successfully getting the new Team Budget container image deployed to ECS:

An error occurred (ClientException) when calling the DescribeTaskDefinition operation: Unable to describe task definition.

I don't recall ever seeing this error, and while the usual victim to blame is AWS IAM policies, we'll have to do some research to figure out how to fully CD again.

OTOH, with a new container image successfully deployed into ECR, we can manually update the CF cluster and see how well that container image behaves in ECS.

MikeTheCanuck commented 4 years ago

The ClientException error turns out to be a predictable if completely forgotten detail:

every Travis repo needs to know what ECS cluster and service they're attempting to deploy the ECR-uploaded container image to
the cluster name (so far) has remained the same, but the service name changed immediately as soon as we switched from EC2 to Fargate - Amazon generated a new Service name when we defined a new service

I just tried updating the service name, and it appears from ECS that it received a new Task Definition, but boy howdy are there are lot of instances of this warning in the Travis job log before it completed:

/home/travis/.local/lib/python2.7/site-packages/urllib3/util/ssl_.py:365: SNIMissingWarning: An HTTPS request has been made, but the SNI (Server Name Indication) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
SNIMissingWarning

MikeTheCanuck commented 4 years ago

If things with the next 2017 API containers go really haywire, read up on #158 and see if there's other corrections needed.

hackoregon / civic-devops

Convert 2017 & 2018 API services from EC2 to Fargate #244