hackoregon / transportation-backend

Transportation backend repo - 2016 and 2017
8 stars 7 forks source link

502 Bad Gateway Error at http://service.civicpdx.org/transport/features/ (timeout issue) #47

Open MikeTheCanuck opened 7 years ago

MikeTheCanuck commented 7 years ago

In the AWS environment at least, the response from /transport/features endpoint is too slow - the ALB load balancer kills the request after 30 seconds.

When running this request via a local Docker container (configured to query the AWS-hosted transportation database), this request times out after ~35 seconds and the browser reports:

This page isn’t working

192.168.99.100 didn’t send any data.
ERR_EMPTY_RESPONSE
znmeb commented 6 years ago

@MikeTheCanuck Is this still open?

MikeTheCanuck commented 6 years ago

A variant of this problem appears to still exist... now I'm getting a 504 timeout error (I just tried the /features route now). I suspect we may still be suffering some performance-related issue.

znmeb commented 6 years ago

Blergh - "sysstat" is installed. Do you know if it's collecting logs?

MikeTheCanuck commented 6 years ago

Redeployed the container at the updated database location. Testing out the endpoints, both /features/ and /nearby/ continue to have trouble:

/features/

Looking in the CloudWatch logs, I see the following log entries corresponding to the GET /features/ request:

[2018-06-03 16:24:32 +0000] [29] [DEBUG] GET /transport/features/
featurecount 18966
[2018-06-03 16:25:02 +0000] [25] [CRITICAL] WORKER TIMEOUT (pid:29)

This tells me the thread (pid:29) working on that request timed out whatever internal work it was doing.

Compare that to a request the /conflicts/ endpoint (which generally returns data within a few seconds):

[2018-06-03 16:25:13 +0000] [43] [DEBUG] GET /transport/conflicts/
featurecount 261

/nearby/

Sadly, the only info logged in Cloudwatch for the immediate 500 response from /nearby/ is this:

[2018-06-03 16:24:11 +0000] [28] [DEBUG] GET /transport/nearby/

Related (?) exception

On another container instance, we're only seeing repeat after repeat of the following log spew:

Exception ignored in: <object repr() failed>
Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/django/contrib/gis/geos/geometry.py", line 129, in __del__
capi.destroy_geom(self._ptr)
File "/usr/local/lib/python3.5/site-packages/django/contrib/gis/geos/libgeos.py", line 157, in __call__
return self.func(*args, **kwargs)
File "/usr/local/lib/python3.5/site-packages/django/contrib/gis/geos/prototypes/threadsafe.py", line 52, in __call__
if not self.thread_context.handle:
File "/usr/local/lib/python3.5/site-packages/gevent/local.py", line 256, in __getattribute__
with _patch(self):
File "/usr/local/lib/python3.5/contextlib.py", line 59, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.5/site-packages/gevent/local.py", line 228, in _patch
with impl.locallock:
File "/usr/local/lib/python3.5/site-packages/gevent/lock.py", line 230, in __enter__
return self.acquire()
File "/usr/local/lib/python3.5/site-packages/gevent/lock.py", line 223, in acquire
rc = self._block.acquire(blocking)
File "src/gevent/_semaphore.py", line 198, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4541)
File "src/gevent/_semaphore.py", line 226, in gevent._semaphore.Semaphore.acquire (src/gevent/gevent._semaphore.c:4367)
File "src/gevent/_semaphore.py", line 166, in gevent._semaphore.Semaphore._do_wait (src/gevent/gevent._semaphore.c:3562)
File "/usr/local/lib/python3.5/site-packages/gevent/hub.py", line 630, in switch
return RawGreenlet.switch(self)
gevent.hub.LoopExit: ('This operation would block forever', <Hub at 0x7f64c9f62cc0 epoll default pending=0 ref=0 fileno=8 resolver=<gevent.resolver_thread.Resolver at 0x7f64c9eeb198 pool=<ThreadPool at 0x7f64ca840f60 0/1/10>> threadpool=<ThreadPool at 0x7f64ca840f60 0/1/10>>)

No idea so far if these are related or just coincidental, but the fact that gevent is suffering while gis-related functions are running (and perhaps ones that aren't thread-safe, but perhaps are just having generalized trouble) is at least noteworthy.