hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

Django app [CRITICAL] WORKER TIMEOUT in Docker on ECS behind an ALB load balancer #157

Closed MikeTheCanuck closed 6 years ago

MikeTheCanuck commented 6 years ago

Ah, last year's most mysterious unresolved bug comes back to us on the final 2018 API to deploy.

Here’s the logs, cycling this same “boot some threads, monkey patch those threads, timeout/abort those threads, exit those threads” over and over:

[2018-06-05 02:16:43 +0000] [57] [INFO] Booting worker with pid: 57
[2018-06-05 02:16:43 +0000] [58] [INFO] Booting worker with pid: 58
[2018-06-05 02:16:43 +0000] [59] [INFO] Booting worker with pid: 59
gunicorn_config.py:80: RuntimeWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. Please monkey-patch earlier.
monkey.patch_all()
[2018-06-05 02:16:43 +0000] [57] [INFO] Made Psycopg2 Green
gunicorn_config.py:80: RuntimeWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. Please monkey-patch earlier.
monkey.patch_all()
[2018-06-05 02:16:43 +0000] [59] [INFO] Made Psycopg2 Green
gunicorn_config.py:80: RuntimeWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. Please monkey-patch earlier.
monkey.patch_all()
[2018-06-05 02:16:44 +0000] [58] [INFO] Made Psycopg2 Green
[2018-06-05 02:17:06 +0000] [64] [INFO] Booting worker with pid: 64
gunicorn_config.py:80: RuntimeWarning: Monkey-patching ssl after ssl has already been imported may lead to errors, including RecursionError on Python 3.6. Please monkey-patch earlier.
monkey.patch_all()
[2018-06-05 02:17:06 +0000] [64] [INFO] Made Psycopg2 Green
[2018-06-05 02:17:38 +0000] [53] [CRITICAL] WORKER TIMEOUT (pid:57)
[2018-06-05 02:17:39 +0000] [57] [INFO] worker received SIGABRT signal
[2018-06-05 02:17:39 +0000] [53] [CRITICAL] WORKER TIMEOUT (pid:58)
[2018-06-05 02:17:40 +0000] [58] [INFO] worker received SIGABRT signal
[2018-06-05 02:17:40 +0000] [53] [CRITICAL] WORKER TIMEOUT (pid:59)
[2018-06-05 02:17:41 +0000] [53] [CRITICAL] WORKER TIMEOUT (pid:64)
[2018-06-05 02:17:41 +0000] [59] [INFO] worker received SIGABRT signal
[2018-06-05 02:17:42 +0000] [59] [INFO] Worker exiting (pid: 59)
[2018-06-05 02:17:41 +0000] [57] [INFO] Worker exiting (pid: 57)
[2018-06-05 02:17:42 +0000] [58] [INFO] Worker exiting (pid: 58)

The workers remain for less than a minute.

MikeTheCanuck commented 6 years ago

Cross-reference with: https://github.com/hackoregon/devops-17/issues/49 https://github.com/hackoregon/team-budget/issues/96

MikeTheCanuck commented 6 years ago

Further investigation reveals that the container was consuming its max memory (100 MB):

screen shot 2018-06-04 at 19 49 08

We've observed this kind of "memory spike and plateau at the allocation limit" behaviour for other containers, and this has always meant that the container needs more memory allocated.

I boosted the allocation to 200 MB and that barely squeaked by - the un-swaggerized container is finally responding. This is what the memory load looks like from with the EC2 host that runs ECS:

docker ps --format "{{.Names}}" | xargs docker stats $1

CONTAINER ID        NAME                                                                                            CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
742b928eb9b2        ecs-housing-affordability-service-9-housing-affordability-service-a692abc4e1a7a1c37900          0.10%               196.3MiB / 200MiB   98.15%              67.4kB / 177kB      158MB / 1.01MB      0

We're boosting the memory to 300 MB because 4 MB is an uncomfortably small margin for future application growth.

MikeTheCanuck commented 6 years ago

Summary

When the container is logging this [CRITICAL] WORKER TIMEOUT issue, look for the container maxing out on memory and allocate more memory. Chances are, the container will stop cycling its workers, and ALB will likely deem it healthy enough to be scheduled traffic.

MikeTheCanuck commented 6 years ago

Coda

After boosting the container's memory allocation to 300 MB, it seems to have relaxed under load and taken up even more memory:

docker ps --format "{{.Names}}" | xargs docker stats $1

CONTAINER ID        NAME                                                                                            CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
879d32b309df        ecs-housing-affordability-service-10-housing-affordability-service-a8d1d0e69d888cf22900         0.10%               221.6MiB / 300MiB   73.86%              12.3MB / 895kB      99.7MB / 1.07MB     0

Lesson: never run that close to the wire.