hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

Service Request: `/housing2019/v1/api/hmdaorwa/` returning 504 error #272

Closed bhgrant8 closed 4 years ago

bhgrant8 commented 5 years ago

TEAM NAME: Housing 2019 PRIORITY (1-5): 4

DO NOT INCLUDE ANY SECRETS IN THIS REQUEST. IT IS PUBLICLY ACCESSIBLE

Description of issue

Attempting to access this endpoint without any query parameters set will return a 504 timeout error

URL: https://service.civicpdx.org/housing2019/v1/api/hmdaorwa/

Error Message/Logs

Here is the error message returned via swagger ui:

<html>
<head><title>504 Gateway Time-out</title></head>
<body bgcolor="white">
<center><h1>504 Gateway Time-out</h1></center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

I also looked in cloudwatch logs regarding the response, there seems to be a 200 response logged, which is interesting:

10.180.9.27 [07/Sep/2019:22:23:19 +0000] GET /housing2019/v1/api/hmdaorwa/ HTTP/1.1 200 10757 - curl/7.54.0 122.341009

Reproduction Steps

  1. I made a request to https://service.civicpdx.org/housing2019/v1/api/hmdaorwa/
  2. I waited for response

Expected: I get a valid 2xx JSON response with data Actual: I got the 504 html error

Code Snippets

Here is example cURL:

curl -X GET "https://service.civicpdx.org/housing2019/v1/api/hmdaorwa/" -H "accept: application/json" 

Screenshots/GIF

Here is swagger:

Screen Shot 2019-09-07 at 3 16 49 PM

I also attempted via the terminal:

Screen Shot 2019-09-07 at 3 17 37 PM

Priority/Impact

Speaking with @nickembrey it appears that this endpoint is not in production, so specific fix will have minimum impacts and has low priority. I am opening issue as maybe a good test case for longer run investigation of services and performance....

bhgrant8 commented 5 years ago

Running this in my development machine, I was able to get a successful 200 response:

Screen Shot 2019-09-07 at 4 33 30 PM

Though took ~2 minutes for the response to come...

Screen Shot 2019-09-07 at 4 08 45 PM

vs. production server, where we do see the expected timeout after 1 minute of waiting (TTFB - Time to First Byte):

Screen Shot 2019-09-07 at 4 35 14 PM

which then results in the 504

Now interestingly the log size is relatively same as the cloud watch logs:

api_1  | 172.20.0.1 [07/Sep/2019:23:06:20 +0000] GET /housing2019/v1/api/hmdaorwa/  HTTP/1.1 200 10750 http://localhost:8000/housing2019/v1/schema/ Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36 122.777203

10750 vs. 10757, which leads me to assume backend database is returning the full response, however django server is taking too long to process/serialize response and timing out.

bhgrant8 commented 5 years ago

Continuing to chase down,

noticed the server in the response header was as follows:

 server: awselb/2.0 

Which indicates that the 504 is being returned in the aws/load balancer level. vs on a successful connection:

server: gunicorn/19.9.0 

this and fact current timeout settings result in successful call locally, it appears that the drop is happening in connection between backend server and elb.

As per AWS Docs:

https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/ts-elb-error-message.html#ts-elb-errorcodes-http504

This maybe effected by current keepalive settings in backend.

We are currently not directly configuring, and as such are accepting the gunicorn default of 2. However the Gunicorn docs themselves, do state that if running behind load balancer, you will want to set higher:

http://docs.gunicorn.org/en/stable/settings.html#keepalive

going to attempt to set to 75 as per: https://serverfault.com/questions/782022/keepalive-setting-for-gunicorn-behind-elb-without-nginx

check if we still get a 504 (going to connect housing project to dev docker for canary test)

bhgrant8 commented 4 years ago

At this time, my feeling is that the indexing of the related tables in the underlying PostGres database is causing the slowdown in API response as there are 2 api calls:

  1. call to return full index
  2. call to pull all data

since we are not using this endpoint in production, but are using individual lookups, providing the index is the more useful implementation at this time.

going to close out as wont-fix at this time, but let's reopen if needed.