Closed MikeTheCanuck closed 4 years ago
@BrianHGrant reported the following that led to no further ECS kill/start cycles for the task:
fyi, i just retriggered a build for housing, as there were deploys after the gevent fix was merged to our docker master
was able to just get this endpoint: https://service.civicpdx.org/housing2019/v1/api/ncdbsamplechanges/
and it appears the hmdaorwa is now giving a 504 gateway timeout vs. causing the container to restart
@nickembrey reports for the table backing the /hmdaorwa/ endpoint that:
That endpoint is slow because the table it queries is m a s s i v e... we ended up deciding not to use the endpoint in production a couple months ago and instead we're using different aggregate tables that are built of that dataset, so it seems somewhat unlikely that anyone is calling it / that it could be the root of whatever intermittent problems we're having.
Brian observed that
it appears that endpoint was returning > previous 30 second timeout but < new 60 second timing
So it appears that the CRITICAL WORKER TIMEOUT is self-induced - we configure a 30-second timeout, and the engine is behaving exactly as instructed by throwing a timeout error and restarting the worker.
Hmmm ... maybe the code is still referencing the table even if the endpoint is unused. Can you safely comment out the references to that table in "models.py", "urls.py", etc?
Brian made the following fix to the baseline container and all requests since then time out safely (i.e. without causing ECS to consider the container unhealthy): https://github.com/hackoregon/2019-backend-docker/pull/21
When requesting http://service.civicpdx.org/housing2019/v1/api/hmdaorwa/, the browser receives "502 Bad Gateway" response every time, then for a few minutes any request to any http://service.civicpdx.org/housing2019/v1/api/ endpoint receives "temporarily unavailable" response.
Cloudwatch recorded the following around the /hmdaorwa/ request:
In previous investigations we've seen this problem due to insufficient memory available to the Django app. I see no memory spike to 100% when reviewing the ECS service Metrics, but I do see a momentary memory drop implying that the task got killed and recreated.
This is corroborated by the ECS events log: