hackoregon / devops-17

deployment tools for Hack Oregon projects
4 stars 3 forks source link

Reduce latency between EC2-hosted database and Django containers #50

Open MikeTheCanuck opened 7 years ago

MikeTheCanuck commented 7 years ago

Summary

Backend Django API containers deployed to ECS are routinely/rapidly deemed "unhealthy" by ALB and bounced out for a new container, which also doesn't work, ad infinitum.

Details

Generally speaking, the backend Django-hosting containers are not a healthy lot. While some will respond to HTTP requests (either to the Swagger root or to the API endpoints themselves), nearly all of them are in some state of disrepair/inability to service client requests consistently.

Potential Issue: database latency

Requests to the Budget database are incredibly slow for non-trivial endpoints, even when running via a local container and talking to the EC2-hosted PostgreSQL:

Oddly, parameterized (i.e. filtered) requests to these endpoints receive super-quick responses.

In the ECS environment, the containers aren't faring any better. In ECS at least, the database is "across the Internet" however - the container app is configured to look for the DB on its external IP address, losing all the benefits of both app + DB being hosted in the same AWS region.

Hell, submitting this request (/budget/history/?fiscal_year=2015-16) via the ECS container still 502'd, but when submitted through a local container, it responded after ~10 seconds

Possible fixes (discussed in #49)

  1. Move to RDS
  2. route from app to DB via private IP addresses in a single VPC
  3. host the PostgreSQL database in an adjacent container

If we had any experience with it to date, the "right" (though likely more costly) answer is start with (1) for as many projects as can tolerate it . That we have no experience with an RDS deployment means we're in danger of sinking days or weeks into figuring that deployment model out, when we have so many other critical tasks between now and Demo Day.

In the absence of (1), (2) sounds like next-best (but adding more complexity to the branching setup we already have), and (3) seems least-good but might be our last resort.

MikeTheCanuck commented 7 years ago

Idea: dig into psycopg2, thread safety, "library-friendly lock"

Interesting information: from this gunicorn bug report I spotted this info about the psycopg2 adapter and wonder if this is related:

Following your pointer, I had a look at the psycopg2 adapter - which we use to connect our Django app to Postgres - and discovered this section of the documentation which states:

Warning: Psycopg connections are not green thread safe and can’t be used concurrently by different green threads. Trying to execute more than one command at time using one cursor per thread will result in an error (or a deadlock on versions before 2.4.2).

Therefore, programmers are advised to either avoid sharing connections between coroutines or to use a library-friendly lock to synchronize shared connections, e.g. for pooling.

In other words - psycopg2 doesn't like green threads. Based on the behaviour we encountered, I would guess that this is the source of the error. The suggested way to deal with this issue, according to the psycopg docs, is to use a library which enables psycopg support for coroutines.

The recommended library is psycogreen.

I don't know squat about "green threads" so I'm hoping one of you fine folks recognize if this is relevant.

MikeTheCanuck commented 7 years ago

Idea: reduce the ALB Health Check timeout

This comment about a conceptually-similar timeout in Heroku makes this approach seem very promising.

MikeTheCanuck commented 7 years ago

Idea: investigate the use of uWSGI

This comment is one anecdote to give us hope?