Answer specific questions about traffic and system load across our sites

MikeTheCanuck commented 6 years ago

There are a number of questions we need to answer, to be able to make adjustments to our infrastructure and ensure that - come the day that we start receiving non-trivial traffic levels - the systems are able to provide an adequate experience (reasonable response times, not a lot of queued requests).

Asking questions about "how much traffic" and "what kinds of performance" are we seeing, can often lead to a bloated effort to swallow, digest and synthesize All The Data: all-the-data

I am not interested in falling into that black hole.

What I'm interested in is identifying pieces of our architecture that need an upgrade.

So here's what I'm most immediately worried about:

disk space consumption on our PostgreSQL server's /data or root volumes, as developers upload more data - we've experienced multiple occasions when the database disk has been filled so completely that incoming requests have no place offload in-memory data to service particularly complex requests.
sustained CPU or memory consumption on the PostgreSQL server - acute, temporary loads are a fact of life no matter how big your server, but sustained loads indicate it's time to upsize or scale out
- memory exhaustion for individual containers (Tasks) - some containers just took more memory that we originally allocated to them. Some would immediately consume more than was available to them, and ECS/ALB would never schedule them into service because they weren't responding to health check requests; others would survive the initial validation, but would later run out of memory and start throwing 50/40 errors to outside requests. It took one of us proactively, manually investigating to find out where the problem was - until then, the APIs were either out of date, or mostly/completely out of commission.
container deploys that get trapped in a loop. During the development run-up in May of 2018 (and similarly during the 2017 season), we experienced a number of occasions when badly-configured containers were being unsuccessfully deployed - ECS would launch a new container instance, and would either find it couldn't start correctly, or wouldn't be healthy enough to be scheduled into service. These containers would be destroyed and recreated over and over (DOA would run multiple times a minute, unhealthy would run many times per hour) until we intervened, and this would exhaust the unmounted storage volume on our EC2 hosts #149, as well as frustrate developers who saw Travis successfully deploy, but nothing would change in their online APIs.
database connection exhaustion - there are many ways for us to manage the connection limits to the database, and many ways for us to configure the APIs and/or database server itself, that lead/led to completely-consumed connection limits #177, #178.

There's also curiosity about how many "hits" each site (and each API service) receives per day, so that we have some idea whether these sites & services could require additional enhancements, and if there's a reason for us to look into performance issues for possible scale-up/scale-out. This "curiosity"-driven work is a lower priority from an engineering/DoS-protection point of view, but may be of more interest from a marketing point of view.

znmeb commented 6 years ago

Both kinds of monitoring (capacity planning and marketing) are subjects near and dear to my heart. ;-)

I suspect the capacity planning is adequately covered by the AWS tools - people wouldn't deploy there if they didn't have the tools to manage and plan capacity. I know a lot about capacity planning, but nearly all of it is useful only in a bare metal environment, not in containers running inside virtual machines. ;-)
On the marketing side, pretty much everyone I know starts with the free tier of Google Analytics. There are other ways to do it but Google Analytics is one that everyone understands.

nam20485 commented 6 years ago

I think the first step, and the way to avoid the black hole, is to gather measurements to characterize our performance. Too often performance tuning or optimization is requested without truly understanding the load on each facet of the system. If we can setup some kind of benchmark or measurement of each of the items you listed, I think we could better understanding where our bottlenecks are, why they are there, and which pieces actually need (or don't) scaling of additional resources. With empirical data we can make informed decisions about where to spend time or money to improve the performance of our system, and justify doing so.

The website usage is a perfect example, if we don't know what the characteristics of it's access are, we have no idea whether we even have a scaling problem.

I would advocate that the first step would be researching the various ways we could gather usage and performance data for each of the components you listed, and then implementing those to gather measurements of the performance of each piece. This way we don't have to don't have to follow up and investigate time optimizing or scaling any one piece until we know which pieces are not performing adequately

It sounds like Ed may have some insight or experience in this area that we could leverage, if we can figure out how to apply it to dockerized containers. Or at least the services we run inside them.

On Sun, Jun 24, 2018, 1:27 PM M. Edward (Ed) Borasky < notifications@github.com> wrote:

Both kinds of monitoring (capacity planning and marketing) are subjects near and dear to my heart. ;-)

1.

I suspect the capacity planning is adequately covered by the AWS tools

people wouldn't deploy there if they didn't have the tools to manage and plan capacity. I know a lot about capacity planning, but nearly all of it is useful only in a bare metal environment, not in containers running inside virtual machines. ;-) 2.

On the marketing side, pretty much everyone I know starts with the free tier of Google Analytics. There are other ways to do it but Google Analytics is one that everyone understands.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/hackoregon/civic-devops/issues/196#issuecomment-399785189, or mute the thread https://github.com/notifications/unsubscribe-auth/ABvqFND4Lg4xcBxiobMaLD3-3UfStdk7ks5t__Y4gaJpZM4U1OsI .

znmeb commented 6 years ago

I would start with the usage monitoring - Google Analytics for the front end and whatever logging Django does by default on API usage. But this requires a conversation inside the organization, not just an issue discussion on GitHub.

nam20485 commented 6 years ago

You mean because it would cost $, or why?

hackoregon / civic-devops

Answer specific questions about traffic and system load across our sites #196