hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

Housing 2019 API endpoint throws [CRITICAL] WORKER TIMEOUT every time it's requested #271

Closed MikeTheCanuck closed 4 years ago

MikeTheCanuck commented 4 years ago

When requesting http://service.civicpdx.org/housing2019/v1/api/hmdaorwa/, the browser receives "502 Bad Gateway" response every time, then for a few minutes any request to any http://service.civicpdx.org/housing2019/v1/api/ endpoint receives "temporarily unavailable" response.

Cloudwatch recorded the following around the /hmdaorwa/ request:

10.180.19.255 [07/Sep/2019:17:51:42 +0000] GET /housing2019/v1/schema/ HTTP/1.1 200 68172 - ELB-HealthChecker/2.0 0.020789
10.180.9.27 [07/Sep/2019:17:51:43 +0000] GET /housing2019/v1/api/hmdaorwa HTTP/1.1 301 0 - Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:68.0) Gecko/20100101 Firefox/68.0 0.001224
[2019-09-07 17:52:14 +0000] [52] [CRITICAL] WORKER TIMEOUT (pid:55)
[2019-09-07 17:52:15 +0000] [58] [INFO] Booting worker with pid: 58
10.180.9.27 [07/Sep/2019:17:52:18 +0000] GET /housing2019/v1/schema/ HTTP/1.1 200 68172 - ELB-HealthChecker/2.0 0.091802

In previous investigations we've seen this problem due to insufficient memory available to the Django app. I see no memory spike to 100% when reviewing the ECS service Metrics, but I do see a momentary memory drop implying that the task got killed and recreated.

This is corroborated by the ECS events log:

2019-09-07 10:53:58 -0700
service hacko-integration-2019Housing-79U03861DR85-Service-754D64X198W has reached a steady state.
2019-09-07 10:53:28 -0700
service hacko-integration-2019Housing-79U03861DR85-Service-754D64X198W registered 1 targets in target-group hacko-Targe-1R5TLJX7STB8Q
2019-09-07 10:52:25 -0700
service hacko-integration-2019Housing-79U03861DR85-Service-754D64X198W has started 1 tasks: task 8c763f1c-99c3-4dfe-8dac-766ff2529730.
2019-09-07 10:52:14 -0700
service hacko-integration-2019Housing-79U03861DR85-Service-754D64X198W has stopped 1 running tasks: task 6468e1ec-1941-49a7-942f-f98c3dea6d60.
2019-09-07 10:52:14 -0700
service hacko-integration-2019Housing-79U03861DR85-Service-754D64X198W deregistered 1 targets in target-group hacko-Targe-1R5TLJX7STB8Q
2019-09-07 10:52:14 -0700
service hacko-integration-2019Housing-79U03861DR85-Service-754D64X198W (port 8000) is unhealthy in target-group hacko-Targe-1R5TLJX7STB8Q due to (reason Request timed out).
MikeTheCanuck commented 4 years ago

@BrianHGrant reported the following that led to no further ECS kill/start cycles for the task:

fyi, i just retriggered a build for housing, as there were deploys after the gevent fix was merged to our docker master

was able to just get this endpoint: https://service.civicpdx.org/housing2019/v1/api/ncdbsamplechanges/

and it appears the hmdaorwa is now giving a 504 gateway timeout vs. causing the container to restart

MikeTheCanuck commented 4 years ago

@nickembrey reports for the table backing the /hmdaorwa/ endpoint that:

That endpoint is slow because the table it queries is m a s s i v e... we ended up deciding not to use the endpoint in production a couple months ago and instead we're using different aggregate tables that are built of that dataset, so it seems somewhat unlikely that anyone is calling it / that it could be the root of whatever intermittent problems we're having.

Brian observed that

it appears that endpoint was returning > previous 30 second timeout but < new 60 second timing

So it appears that the CRITICAL WORKER TIMEOUT is self-induced - we configure a 30-second timeout, and the engine is behaving exactly as instructed by throwing a timeout error and restarting the worker.

znmeb commented 4 years ago

Hmmm ... maybe the code is still referencing the table even if the endpoint is unused. Can you safely comment out the references to that table in "models.py", "urls.py", etc?

MikeTheCanuck commented 4 years ago

Brian made the following fix to the baseline container and all requests since then time out safely (i.e. without causing ECS to consider the container unhealthy): https://github.com/hackoregon/2019-backend-docker/pull/21