Gunicorn+Django+ALB continuously throwing [CRITICAL] WORKER TIMEOUT errors

hackoregon / devops-17

deployment tools for Hack Oregon projects

4 stars 3 forks source link

Gunicorn+Django+ALB continuously throwing [CRITICAL] WORKER TIMEOUT errors #49

Closed MikeTheCanuck closed 6 years ago

MikeTheCanuck commented 7 years ago

Summary

Django apps are timing out frequently, and we are concerned this is (one of) the reason why AWS is terminating and restarting "unhealthy" containers on a continuous basis, for most of the project backends.

Details

Many of the backend projects are experiencing repeated and troubling WORKER TIMEOUT events.

E.G. a typical CloudWatch log for a Budget backend container instance shows this kind of output:

Running docker-entrypoint.sh...
[2017-03-30 15:43:54 +0000] [10] [INFO] Starting gunicorn 19.7.1
[2017-03-30 15:43:54 +0000] [10] [INFO] Listening at: http://0.0.0.0:8000 (10)
[2017-03-30 15:43:54 +0000] [10] [INFO] Using worker: sync
[2017-03-30 15:43:54 +0000] [13] [INFO] Booting worker with pid: 13
[2017-03-30 15:44:25 +0000] [10] [CRITICAL] WORKER TIMEOUT (pid:13)
[2017-03-30 15:44:25 +0000] [13] [INFO] Worker exiting (pid: 13)
[2017-03-30 15:44:29 +0000] [15] [INFO] Booting worker with pid: 15
[2017-03-30 15:44:59 +0000] [10] [CRITICAL] WORKER TIMEOUT (pid:15)
[2017-03-30 15:44:59 +0000] [15] [INFO] Worker exiting (pid: 15)
[2017-03-30 15:45:03 +0000] [17] [INFO] Booting worker with pid: 17
[2017-03-30 15:45:35 +0000] [10] [CRITICAL] WORKER TIMEOUT (pid:17)
[2017-03-30 15:45:35 +0000] [17] [INFO] Worker exiting (pid: 17)
[2017-03-30 15:45:36 +0000] [19] [INFO] Booting worker with pid: 19
[2017-03-30 15:46:08 +0000] [10] [CRITICAL] WORKER TIMEOUT (pid:19)
[2017-03-30 15:46:08 +0000] [19] [INFO] Worker exiting (pid: 19)
[2017-03-30 15:46:09 +0000] [21] [INFO] Booting worker with pid: 21
[2017-03-30 15:46:40 +0000] [10] [CRITICAL] WORKER TIMEOUT (pid:21)
[2017-03-30 15:46:40 +0000] [21] [INFO] Worker exiting (pid: 21)
[2017-03-30 15:46:41 +0000] [23] [INFO] Booting worker with pid: 23

(As far as I can tell...) Not long after the last log entry, AWS deems the container "unhealthy", spins up a replacement, and terminates the unhealthy one.

Some of us have anecdotally observed these same CRITICAL WORKER TIMEOUT messages when running the Django app in daemon mode (e.g. upon running start-proj.sh) on our developer computers as well.

It's unclear at this point whether this occurs when the app is receiving no traffic (idle) or only when the app receives requests.

This seems a widespread problem with our backend apps. That said, there's no obvious answer to what we should do to address it.

Ideas

Theory: gunicorn has a more aggressive design to kill off unused worker threads than appropriate for an often-idle API
Theory: our current PostgreSQL databases are performing badly under the Health Check tests
Theory: there is some common/default behaviour in Django and/or Django Rest Framework that needs to be tuned/configured to eliminate this problem
Possible fix: use a different reverse proxy app than gunicorn (e.g. nginx)
Possible fix: use an asynchronous library/module to adapt the Django app to the expected behaviour of these APIs
Possible workaround: increase the number of workers, to reduce the impact of timed-out-workers on the external probes from
Possible fix: convert to RDS from PostgreSQL-on-EC2
Investigation idea: enable debug logging in gunicorn to get more info on what might be causing these timeouts

Links to discussions

http://stackoverflow.com/questions/37802350/gunicorn-django-critical-worker-timeout?rq=1 http://stackoverflow.com/questions/10855197/gunicorn-worker-timeout-error?rq=1 http://stackoverflow.com/questions/15463067/gunicorn-worker-timeout

Possible related issues

502/504 responses (https://github.com/hackoregon/team-budget/issues/87)
App code not updating (https://github.com/hackoregon/devops-17/issues/47)

BrianHGrant commented 7 years ago

Building off the research listed above, thinking it is worth experimenting with async workers

As per the Gunicorn documentation on server design: http://docs.gunicorn.org/en/stable/design.html

"The default synchronous workers assume that your application is resource-bound in terms of CPU and network bandwidth. Generally this means that your application shouldn’t do anything that takes an undefined amount of time. An example of something that takes an undefined amount of time is a request to the internet. At some point the external network will fail in such a way that clients will pile up on your servers. So, in this sense, any web application which makes outgoing requests to APIs will benefit from an asynchronous worker."

I think the line about the undefined about of time might be key here, as we cannot confirm the length of time a call to the db will take.

BrianHGrant commented 7 years ago

Local machine experiment does not show critical worker timeouts after a few queries on endpoints. going to try a deploy.

MikeTheCanuck commented 7 years ago

I like that possibility as a short-term bandaid over this problem we're having just stabilizing the containers in ECS. It's certainly something we could test out and see if the container stability significantly improves.

However, there's something that bugs me about the idea of treating the Django-to-DB communication as that unpredictable. Async APIs in a setting like this were designed to make external API calls (that have all the usual reliability and latency issues with the Internet) more robust.

Treating the "local"/"neighbour" database server as a far-flung, unreliable foreigner seems like a bad design choice overall. If our database is that unreliable, we should have the power to bring it back in line - networking/VPC issues, engine, database configuration, data model, query structures.

There's just so many software deployments like ours that seem to stabilize quickly and stay reliable (at least reliable on the time scale of days or weeks, not seconds or minutes like we're facing).

BrianHGrant commented 7 years ago

I do see the point in what you are saying. We should be able to have reliable communication within our vpc/ecosystem.

A question floating around in my mind is how are we connecting to the dbs? Is it in a reliable "local" fashion or are we effectively connecting through the frontdoor so to speak when we specify the hostname and other features?

That fact that the creds are the same on my local machine as in the aws cloud I think it might be the latter.

MikeTheCanuck commented 7 years ago

I think you're definitely on to something @BrianHGrant - can't believe I've forgotten this challenge for so long.

All the container deploys I've been involved in have just used the standard project_config.py and all have used the external IP address to connect. In the Travis build environment that makes sense, because there's no route to any private IPs or VPCs for Travis.

However, when the Django app calls from our VPC to the database, it's also hitting it via external IP and that almost certainly means lots of unnecessary latency.

Damn. And that's not a trivial thing to change, especially because of the Travis dependency.

We could, in theory:

configure Django in the ECS setting to route using the private IP address for the database VM
move some projects to RDS (those that are just using vanilla SQL like Budget, for example)
deploy the database in a container running alongside the associated API container

If we had any experience with it to date, the "right" answer is start with (2) for as many projects as can tolerate it (e.g. for which we didn't have a database-engine-specific dependency like I worry that our GIS environments might have). That we have no experience with an RDS deployment means we're in danger of sinking days or weeks into figuring that deployment model out, when we have so many other critical tasks between now and Demo Day.

In the absence of (2), (1) sounds like next-best (but adding more complexity to the branching setup we already have), and (3) seems least-good but might be our last resort.

BrianHGrant commented 7 years ago

For 2, the question of postgis, AWS RDS does bake in PostGIS support: https://aws.amazon.com/rds/postgresql/ and http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.html#Appendix.PostgreSQL.CommonDBATasks.PostGIS

For Emergency Response, adding the extensions is baked into our DDL (think this is correct SQL term) or would be part of a pg_dump if we do that. As you said we have no experience in other aspects of the deployment so not sure of other issues that would arrise.

On the other note, this build is using the 'eventlet' package for async workers, still not resulting updated deploy.

This might be relevant as well, https://github.com/benoitc/gunicorn/issues/1194

MikeTheCanuck commented 7 years ago

That issue 1194 you found is a fantastic resource Brian.

It sounds to me like ELB (and likely ALB as well) keeps a TCP connection open for a "long time" and ties up one or more workers so that the app is less likely to be able to respond in a timely fashion to additonal requests.

This gives us a few options:

switch to gevent as you're doing (maybe doesn't solve the app-code-update issue but could help eliminate TIMEOUTs)
reduce the ALB timeout (whichever one affects this) below the threshold that causes the TIMEOUT
boost the number of workers well beyond the number needed to feed ALB

Wish I could say we're solving the update issue, but eliminating this TIMEOUT will help everyone and will reduce the number of degrees of freedom we need to chase while resolving the missing updates.

MikeTheCanuck commented 7 years ago

Update: I've added the configuration to the Budget project that Brian committed to the Emerg project, and I've still encountered repeated CRITICAL WORKER TIMEOUT, this time on a local build of the Docker container fired with start-proj.sh -l:

budget-service_1  | Running docker-entrypoint.sh...
budget-service_1  | [2017-03-30 21:56:46 +0000] [5] [INFO] Starting gunicorn 19.7.1
budget-service_1  | [2017-03-30 21:56:46 +0000] [5] [INFO] Listening at: http://0.0.0.0:8000 (5)
budget-service_1  | [2017-03-30 21:56:46 +0000] [5] [INFO] Using worker: eventlet
budget-service_1  | [2017-03-30 21:56:46 +0000] [8] [INFO] Booting worker with pid: 8
budget-service_1  | [2017-03-30 21:56:46 +0000] [9] [INFO] Booting worker with pid: 9
budget-service_1  | [2017-03-30 21:56:46 +0000] [10] [INFO] Booting worker with pid: 10
budget-service_1  | [2017-03-30 21:56:46 +0000] [11] [INFO] Booting worker with pid: 11
budget-service_1  | [2017-03-30 21:56:46 +0000] [12] [INFO] Booting worker with pid: 12
budget-service_1  | Not Found: /
budget-service_1  | Not Found: /favicon.ico
budget-service_1  | [2017-03-30 21:59:47 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:9)
budget-service_1  | [2017-03-30 21:59:47 +0000] [9] [INFO] Worker exiting (pid: 9)
budget-service_1  | [2017-03-30 21:59:48 +0000] [18] [INFO] Booting worker with pid: 18
budget-service_1  | [2017-03-30 22:01:23 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:10)
budget-service_1  | [2017-03-30 22:01:23 +0000] [10] [INFO] Worker exiting (pid: 10)
budget-service_1  | [2017-03-30 22:01:24 +0000] [20] [INFO] Booting worker with pid: 20
budget-service_1  | [2017-03-30 22:03:00 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:8)
budget-service_1  | [2017-03-30 22:03:00 +0000] [8] [INFO] Worker exiting (pid: 8)
budget-service_1  | [2017-03-30 22:03:01 +0000] [22] [INFO] Booting worker with pid: 22
budget-service_1  | [2017-03-30 22:04:54 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:18)
budget-service_1  | [2017-03-30 22:04:54 +0000] [18] [INFO] Worker exiting (pid: 18)

These are likely in response to the super-delayed requests I fired to the /ocrb/ and /history/ endpoints. In fact, I left the Chrome browser to retry when it received a non-200 HTTP response, and I watched as the TIMEOUTs continued to pile up on a regular cadence:

budget-service_1  | [2017-03-30 22:23:15 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:12)
budget-service_1  | [2017-03-30 22:23:15 +0000] [12] [INFO] Worker exiting (pid: 12)
budget-service_1  | [2017-03-30 22:23:16 +0000] [26] [INFO] Booting worker with pid: 26
budget-service_1  | [2017-03-30 22:24:51 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:22)
budget-service_1  | [2017-03-30 22:24:51 +0000] [22] [INFO] Worker exiting (pid: 22)
budget-service_1  | [2017-03-30 22:24:52 +0000] [28] [INFO] Booting worker with pid: 28
budget-service_1  | [2017-03-30 22:26:30 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:24)
budget-service_1  | [2017-03-30 22:26:30 +0000] [24] [INFO] Worker exiting (pid: 24)
budget-service_1  | [2017-03-30 22:26:32 +0000] [30] [INFO] Booting worker with pid: 30

I am nearly convinced after a number of tests and waiting periods, that the [CRITICAL] WORKER TIMEOUT only occurs in response to outside stimuli. There is no inherent behaviour in gunicorn or Django that would cause the workers to spontaneously timeout; rather, the times we've observed this in the AWS environment are almost certainly a direct result of the regular requests issued by ALB to test whether the container is "healthy".

This means I can stop worrying about internal behaviour of a not-nearly-as-complex-as-I-thought app framework, and concentrate on the deterministic if numerous behaviours of which we're in control.

MikeTheCanuck commented 7 years ago

I've applied the "gevent" patch to the Budget backend container and deployed it via PR 94 and Travis build 263.

Now instead of single TIMEOUT errors at a time, we get a whole batch of them at once (according to the CloudWatch logs):

Running docker-entrypoint.sh...
[2017-03-30 23:38:37 +0000] [5] [INFO] Starting gunicorn 19.7.1
[2017-03-30 23:38:37 +0000] [5] [INFO] Listening at: http://0.0.0.0:8000 (5)
[2017-03-30 23:38:37 +0000] [5] [INFO] Using worker: eventlet
[2017-03-30 23:38:38 +0000] [8] [INFO] Booting worker with pid: 8
[2017-03-30 23:38:40 +0000] [9] [INFO] Booting worker with pid: 9
[2017-03-30 23:38:41 +0000] [11] [INFO] Booting worker with pid: 11
[2017-03-30 23:38:42 +0000] [12] [INFO] Booting worker with pid: 12
[2017-03-30 23:38:42 +0000] [10] [INFO] Booting worker with pid: 10
[2017-03-30 23:40:08 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:8)
[2017-03-30 23:40:12 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:9)
[2017-03-30 23:40:12 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:10)
[2017-03-30 23:40:12 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:11)
[2017-03-30 23:40:12 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:12)
[2017-03-30 23:40:13 +0000] [8] [INFO] Worker exiting (pid: 8)
[2017-03-30 23:40:13 +0000] [10] [INFO] Worker exiting (pid: 10)
[2017-03-30 23:40:16 +0000] [16] [INFO] Booting worker with pid: 16
[2017-03-30 23:40:16 +0000] [17] [INFO] Booting worker with pid: 17
[2017-03-30 23:40:18 +0000] [18] [INFO] Booting worker with pid: 18
[2017-03-30 23:40:20 +0000] [19] [INFO] Booting worker with pid: 19
[2017-03-30 23:40:23 +0000] [20] [INFO] Booting worker with pid: 20

Progress?

MikeTheCanuck commented 7 years ago

The use of "eventlet" worker class didn't fix the issue for Budget - it just piled up [CRITICAL] WORKER TIMEOUT errors all in a row - but it seemed to eliminate the CWT's for Emergency Response.

Further, I finally stumbled into the gunicorn FAQ](http://docs.gunicorn.org/en/latest/faq.html) and spotted this tidbit:

The default Sync workers are designed to run behind Nginx which only uses HTTP/1.0 with its upstream servers. If you want to deploy Gunicorn to handle unbuffered requests (ie, serving requests directly from the internet) you should use one of the async workers.

Looks like Nginx and/or "eventlet" or another async worker class is definitely recommended here.

And BTW, their design doc recommends we only need 3 workers in our current setup.

MikeTheCanuck commented 7 years ago

New hypothesis: Django startup is pulling in all the data from database, creating unbearable latency for the ALB Health Check such that it never schedules the new containers for active duty. See Budget issue 96 for details.

MikeTheCanuck commented 7 years ago

Per #47 I tried Brian's tests to no further improvement:

I tried the same --verbose and --timeout 300 settings in the ecs-deploy.sh script run, and I got the same "service running" result at the end of the Travis log:

+echo 'Service updated successfully, new task definition running.'
Service updated successfully, new task definition running.
+[[ 0 -gt 0 ]]
+exit 0

And yet I had no luck actually making it past the ALB Health Check for those deployed container instances:

service hacko-integration-BudgetService-16MVULLFXXIDZ-Service-1BKKDDHBU8RU4 (instance i-04e4ff1f307addd28) (port 49740) is unhealthy in target-group hacko-Targe-OPDM70EA36WQ due to (reason Request timed out)

And looking at the associated CloudWatch logs, this is the output we get from inside the container instance:

Running docker-entrypoint.sh...
[2017-04-03 00:08:02 +0000] [5] [INFO] Starting gunicorn 19.7.1
[2017-04-03 00:08:02 +0000] [5] [INFO] Listening at: http://0.0.0.0:8000 (5)
[2017-04-03 00:08:02 +0000] [5] [INFO] Using worker: sync
[2017-04-03 00:08:02 +0000] [8] [INFO] Booting worker with pid: 8
[2017-04-03 00:08:32 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:8)
[2017-04-03 00:08:32 +0000] [8] [INFO] Worker exiting (pid: 8)
[2017-04-03 00:08:35 +0000] [10] [INFO] Booting worker with pid: 10
[2017-04-03 00:09:05 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:10)
[2017-04-03 00:09:05 +0000] [10] [INFO] Worker exiting (pid: 10)
[2017-04-03 00:09:06 +0000] [12] [INFO] Booting worker with pid: 12
[2017-04-03 00:09:37 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:12)
[2017-04-03 00:09:40 +0000] [14] [INFO] Booting worker with pid: 14
[2017-04-03 00:10:10 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:14)
[2017-04-03 00:10:10 +0000] [14] [INFO] Worker exiting (pid: 14)
[2017-04-03 00:10:11 +0000] [16] [INFO] Booting worker with pid: 16
[2017-04-03 00:10:42 +0000] [5] [CRITICAL] WORKER TIMEOUT (pid:16)
[2017-04-03 00:10:42 +0000] [16] [INFO] Worker exiting (pid: 16)
[2017-04-03 00:10:43 +0000] [18] [INFO] Booting worker with pid: 18

joaoferrao commented 6 years ago

I see this is closed, but did any of of the participants had any luck overcoming the timeout issue? I'm facing this very same problem with ECS and ALB with a container running Django+Gunicorn as well as another container running Nginx.

gdw2 commented 6 years ago

It's still open @tekneee

MikeTheCanuck commented 6 years ago

The issue on the gunicorn repo continues to cough up promising ideas and findings - most recent last couple of comments were especially tantalizing: https://github.com/benoitc/gunicorn/issues/1194#issuecomment-378903667

We haven't had a need to dig into this since our existing containers are especially under-utilized, and we haven't fielded new ones yet this "season" at Hack Oregon, but I'm keeping my eye on this gunicorn issue in case we need to revisit.

MikeTheCanuck commented 6 years ago

Spotted this issue again tonight on a new container - bottom line is, allocating more memory to the container was coincident with (and probably the cause of) these errors going away in ECS/CloudWatch logs. See https://github.com/hackoregon/civic-devops/issues/157 for details.