Open MikeTheCanuck opened 5 years ago
I've been working the problem of making Fargate work in our existing CF stack (#238) and have it to the point where it appears that the container deploys and starts up as an ECS task, and is able to configure itself using remote resources (e.g. SSM parameters).
Next is to come at it from the opposite angle - can we host a known-good container as a Fargate task and have it successfully respond to outside requests? To weave these two together, the best approach is to try converting an existing EC2-based task to Fargate. I've picked the Budget 2017 API container as (a) I know how it's supposed to work, having been on the team and (b) it's very unlikely to be getting traffic so downtime is at its most tolerable.
I've adapted what I believe to be a working Fargate template to the Budget-Service and at this point it appears that once the container is up and running, ALB's health check tests are returning a 400 error:
service hacko-integration-FargateBudget-1JWOQ5F2TR1R1-Service-5BYKU2RAR8PL (port 8000) is unhealthy in target-group hacko-Targe-1VL2ENM8F80KM due to (reason Health checks failed with these codes: [400]).
I had a theory that this was a 400 error due to the Security Groups configuration only allowing access to ports 80 and 443, but not to the port (8000) on which the containerized app is listening - and to which the Health Check must be able to connect to verify that the container is ready to take requests.
The Security Group was explicitly configured on the Fargate examples I modelled this after (but I didn't know why, just thought "there must be some reason why Fargate requires this"). So I tried commenting out all references in the master and the task templates to the Security Group.
That still didn't result in a container that was deemed "healthy" even if it was deployed into the cluster and even though according to CloudWatch logs the application in the container has completed all its needed startup (i.e. I can see no errors in the logs, and I see that the app is Listening at: http://0.0.0.0:8000 (10)
). However this time ECS reports a different error when deregistering the task:
service hacko-integration-FargateBudget-V9Y1T0QMZL36-Service-1EAD5R8VV1XAC (port 8000) is unhealthy in target-group hacko-Targe-1WBYRP9UXOBPB due to (reason Request timed out).
So I'm going to dig deeper into the Security Group configuration and ensure it's explicitly allowing incoming traffic on the port(s) that the container is configured to listen on.
Note: I'm currently working from this experimental branch (i.e. I don't expect to PR this or merge to master): https://github.com/hackoregon/hackoregon-aws-infrastructure/tree/civic-devops-244
I've uncommented the Security Group references, generated a new security group, and explicitly granted access to the following port combos (from:to):
...and we're back to the original issue - i.e. the Security Group is getting created as well as it was last week, but the extra port combos doesn't solve the "400" problem:
service hacko-integration-FargateBudget-V9Y1T0QMZL36-Service-1EAD5R8VV1XAC (port 8000) is unhealthy in target-group hacko-Targe-1WBYRP9UXOBPB due to (reason Health checks failed with these codes: [400]).
The very first hit on that error message leads me back to this article that got me started down this road, so I'll take another stab at other possibilities: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html
Assumption: "reason Health checks failed with these codes: [400]" is indicating that some listener in the cluster responded with an HTTP 400 response code ("Bad request").
We can eliminate the container application itself, since I can verify from CloudWatch logs that the container never records an incoming HTTP request - the final entries in the CloudWatch log for each instance of this "unhealthy" container are:
126 static files copied to '/code/staticfiles', 126 post-processed.
[2019-06-29 21:02:13 +0000] [10] [INFO] Starting gunicorn 19.7.1
[2019-06-29 21:02:13 +0000] [10] [INFO] Listening at: http://0.0.0.0:8000 (10)
[2019-06-29 21:02:13 +0000] [10] [INFO] Using worker: gevent
[2019-06-29 21:02:13 +0000] [13] [INFO] Booting worker with pid: 13
/code/budget_proj/wsgi.py:19: RuntimeWarning: Patching more than once will result in the union of all True parameters being patched
from gevent import monkey; monkey.patch_all(thread=False)
Which are the same as for a healthy instance of this same container without the addition of any entries like the following:
10.180.21.196 [29/Jun/2019:21:04:45 +0000] GET /disaster-resilience/ HTTP/1.1 200 23930 - ELB-HealthChecker/2.0 0.125068
10.180.13.204 [29/Jun/2019:21:05:07 +0000] GET /disaster-resilience/ HTTP/1.1 200 33349 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0 0.071944
SecurityGroup? OK, so I've been messing with the SecurityGroup assigned to the task and seeing if it was too restrictive. After a number of iterations I finally opened it up to all protocols and IPs, and we're still getting:
service hacko-integration-FargateBudget-V9Y1T0QMZL36-Service-1EAD5R8VV1XAC (port 8000) is unhealthy in target-group hacko-Targe-1WBYRP9UXOBPB due to (reason Health checks failed with these codes: [400]).
And I'm still seeing no evidence in CloudWatch that the container app is seeing any requests.
Couple of possibilities:
Subnets? Here's another thought, after trolling through the templates and looking at the stuff that isn't the same as the EC2 services: what about the Subnet to which the Fargate task is deployed? I see that it's specified as
Subnets: !GetAtt VPC.Outputs.PrivateSubnets
...in the master yaml, and it's not immediately clear if that's the same subnets as for the EC2 tasks, or if they're on the PublicSubnets
.
Digging through the ECS-cluster.yaml and its params from master.yaml, in fact it is the same subnets:
ECS:
Type: AWS::CloudFormation::Stack
Properties:
TemplateURL: https://s3-us-west-2.amazonaws.com/hacko-infrastructure-cfn/infrastructure/ecs-cluster.yaml
Parameters:
...
Subnets: !GetAtt VPC.Outputs.PrivateSubnets
Network? Next suspect on the list is this section of the Service definition:
NetworkConfiguration:
AwsvpcConfiguration:
AssignPublicIp: ENABLED
SecurityGroups:
- !Ref SecurityGroup
Subnets:
- !Select [ 0, !Ref Subnets ]
- !Select [ 1, !Ref Subnets ]
That bit about AssignPublicIp: ENABLED
makes me wonder if we're giving AWS incompatible instructions - "Private Subnet" but assigning the service a Public IP address?
Network Interfaces? Finally in thoroughly crawling the resources EC2, I come across this page, which makes me wonder (a) is there a way to find out what network interface & subnet are connected to the Fargate task, and (b) do we have it hooked up correctly? (Just because things look fine to the eyeball by comparing the template content to the EC2 tasks doesn't make it so.)
Override the health check? This article gave me a crazy idea: https://www.reddit.com/r/aws/comments/8grpgk/what_does_health_checks_failed_with_these_codes/
What if (even just temporarily, to get one level deeper into the cluster & logs) we told the Health Check to accept 400 as "healthy"?
Edit: that was unexpected - seems to be working but not in any explainable way:
By all measures, the cluster is successfully sending the /budget requests to a container that runs the Budget API, but I cannot see any confirmation in CloudWatch that the Fargate task that's healthy and running is the one that's responding to those requests.
In the past we've run into a situation like this, and though I can't remember the details precisely, I do recall that the lesson was "don't trust that you're running the container you think you are until you can track it down and prove it".
Update on the lack of CloudWatch logs for the Budget task via Fargate:
I've redeployed the EC2-based Budget task and confirmed that it also doesn't show any incoming requests in the CloudWatch logs, so this may not be indicative of a problem.
Reviewing which Django apps are showing incoming requests in their CW logs:
Servicing request for /2017/
)Servicing request for /2018/
)So next I'll try deploying Emergency Response in Fargate and verify that I'm seeing requests logged in CloudWatch. If good, then it'll be time to start setting up the 2019 API containers (though in a way that will allow Django apps returning 400 to get scheduled into service, which is a piece of Tech Debt we're liable to forget if I don't log it soon).
OK I think we nailed it:
Both the 2017 Budget and 2017 Emergency Response services are now deploying and remaining healthy on Fargate. This PR enabled the whole shebang: https://github.com/hackoregon/hackoregon-aws-infrastructure/pull/69
Migrating the remaining EC2-based ECS services should thus just be a matter of copying the pattern established with these two.
Status of this effort: 2017 Emergency Services container is in good shape. It survived even the refactoring effort that is underway here: https://github.com/hackoregon/hackoregon-aws-infrastructure/issues/76
However, 2017 Budget container was not so lucky. Container is deemed unhealthy by ALB under that refactored deploy.
Referencing @DingoEatingFuzz 's recent document https://github.com/hackoregon/civic-devops/blob/master/docs/HOWTO-Update-an-API-repo-to-use-Fargate.md, the most likely culprit is some out-of-sync'ness with the recent changes that Michael made to Emergency Services in PR's 124-126 here: https://github.com/hackoregon/emergency-response-backend/pulls?q=is%3Apr+is%3Aclosed
As well as the direct commits on July 1, 2019: https://github.com/hackoregon/emergency-response-backend/commits/master
HOWEVER, also important to note that the Budget container eventually stabilizes and starts answering healthily when deployed on the current master
branch configuration - which is this commit:
https://github.com/hackoregon/hackoregon-aws-infrastructure/tree/3a0fe7b1f1481c421097ab3335653bbeb26aecda
Thus, there's some difference between the 2017-budget-api.yaml (and its pass-ins from master.yaml) and the 2017-fargate-api.yaml (and its associated pass-ins) that causes the same container image to succeed with the former and fail with the latter.
Further status on updating the Budget container: along with the fixes made to emergency-service, there have been a number of changes made in a branch on the Budget repo to (a) figure out how the Travis sequence works and (b) get the pytest
test cases to pass in Travis.
What I'm finding is that the Emergency Services container doesn't actually leverage SSM parameters to pass its build & test, which explains why it still deploys to ECR without actually having permission to read from SSM.
Additional changes I've had to make that weren't recently done for the Emergency container (but may have been present for longer) include:
travis-docker-compose.yml
(to allow docker-compose
to pass them into the container)2018-ecs-ecr-deployer
account.That IAM user will need to be augmented with a policy that allows access to the /production/2017/API/
SSM namespace, like follows:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ssm:DescribeParameters"
],
"Resource": "*"
},
{
"Sid": "Stmt1482841904000",
"Effect": "Allow",
"Action": [
"ssm:GetParameters"
],
"Resource": [
"arn:aws:ssm:us-west-2:845828040396:parameter/production/2017/*"
]
},
{
"Sid": "Stmt1482841948000",
"Effect": "Allow",
"Action": [
"kms:Decrypt"
],
"Resource": [
"arn:aws:kms:us-west-2:845828040396:key/0280a59b-d8f5-44e0-8b51-80aec2f27275"
]
}
]
}
(See https://github.com/hackoregon/civic-devops/issues/114#issuecomment-391201554 for the first time we setup such a policy.)
With PR 173 (https://github.com/hackoregon/team-budget/pull/173) in Team Budget repo, we have successfully refactored to get build and test steps working again (even with the new SSM world view of configuration management), and for the validated container image to be published to ECR.
However there is an error during the deploy stage that prevents Travis from successfully getting the new Team Budget container image deployed to ECS:
An error occurred (ClientException) when calling the DescribeTaskDefinition operation: Unable to describe task definition.
I don't recall ever seeing this error, and while the usual victim to blame is AWS IAM policies, we'll have to do some research to figure out how to fully CD again.
OTOH, with a new container image successfully deployed into ECR, we can manually update the CF cluster and see how well that container image behaves in ECS.
The ClientException error turns out to be a predictable if completely forgotten detail:
/home/travis/.local/lib/python2.7/site-packages/urllib3/util/ssl_.py:365: SNIMissingWarning: An HTTPS request has been made, but the SNI (Server Name Indication) extension to TLS is not available on this platform. This may cause the server to present an incorrect TLS certificate, which can cause validation failures. You can upgrade to a newer version of Python to solve this. For more information, see https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
SNIMissingWarning
If things with the next 2017 API containers go really haywire, read up on #158 and see if there's other corrections needed.
Summary
Migrate all existing containers to Fargate
Definition of Done
All existing containers are running as Fargate tasks in the existing infrastructure.