hackoregon / devops-17

deployment tools for Hack Oregon projects
4 stars 3 forks source link

ALB not sending traffic to newly-deployed container instances #51

Closed MikeTheCanuck closed 6 years ago

MikeTheCanuck commented 7 years ago

Superset problem we've been attacking from many angles:

Many projects (Housing, Emergency Response, Budget) cannot get their newly-deployed container images to run stable in CloudFormation/ECS. It appears that ALB is checking the newly-started container instances with the basic HTTP health check, determining the container instance is unhealthy, and never introducing the newly-deployed container instance to ALB traffic. Instead, CF/ALB/ECS start two new instances based on the same "latest" container image and terminates the unhealthy instances. On and on, never stopping, because the container image doesn't ever achieve "healthy" status under the ALB Health Check conditions.

There are at least two types of reported failures by ALB/ECS, reporting one of these errors:

While the current working hypothesis is that the Django app startup is the root cause, we may be able to workaround this problem or get just under the "unhealthy" threshold to at least begin rolling new container image deploys into the Integration environment.

This is a major blocking issue, and all possible fixes/workarounds should be pursued to get container instances online ASAP. Budget is working off old code, Emergency Response is blocked on launching significant new functionality, and Housing cannot even get their first container running.

Here's the best article I've found so far to help us dig deeper into the ALB problem space: http://stackoverflow.com/questions/20684459/how-to-debug-failed-aws-elastic-load-balancer-health-checks

MikeTheCanuck commented 7 years ago

"Request timed out" in the case of the Budget container was a 'simple' case of the load balancer still not configured as an ALLOWED_HOST in Django, because the Docker container wasn't configuring the correct Django settings file. This commit made the necessary update to the Dockerfile: https://github.com/hackoregon/team-budget/pull/99/commits/a99b920062ba498d20a93d7b19deede0cc2d0cfa