hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

Create/implement task definition for 2018 Transportation-Systems-Service project #104

Closed iant01 closed 6 years ago

iant01 commented 6 years ago

Create the service sub directory and service.yaml file for use in getting the service task definition into ECS.

iant01 commented 6 years ago

created PR16 changes needed to master.yaml to add transportation-systems service and service.yaml file to define task definition and load balancer listener rule for service.

Right now, the following items have arbitrarily set values: Host: staging-2018.civicpdx.org Path: /transportation-systems Port:3000 Priority: 40 (needs to be before the civic-2018 service and the civic-lab service) Memory: 2048 2GB (last years service was a memory pig, hopefully this years will be less, setting high to start)

znmeb commented 6 years ago

@iant01 How much memory did we use last year? And how do you measure it? Is there some way we can test this locally before deploying?

iant01 commented 6 years ago

Can possible use docker stat on a running host to get the memory info on the running containers. Either a container developer would need to run the command on their local system or on another ECS instance. Since we can't ssh into the hacko's container instance to run the command we might be able to run the transportation-systems container on another ECS instance, but I have not had any success running the 2017 container in my AWS account, so may not have success with the 2018 container. I will give it a try.

There may be a docker API that might work to the hacko ECS instance, but again we might need an access key to get in.

znmeb commented 6 years ago

This is the API containers, right? If those look like this year's API images from the backend-examplar, either there's an AWS way to monitor their usage or we'd need console access to the Docker host. :-(

MikeTheCanuck commented 6 years ago

@znmeb , is there any chance of running the container locally, performing a few operations through the API (to load up some in-memory data) and running the docker stat command as Ian suggested above?

MikeTheCanuck commented 6 years ago

There is no way we're going to throw 1/4 of our available memory at a new container "just in case" - this was only done last year as a last-minute, last-resort fix, and no one's had time to go back and characterize that pig since then.

znmeb commented 6 years ago

Yeah, I can spin it up locally but this isn't the full API. Should I just use the Docker host default settings for container resource usage?

It would be really nice if we could build resource limiting into the images - interpreted languages like Python tend to take up all the RAM they can find even if they're sharing it with a dozen other containers / VMs they don't know about.

MikeTheCanuck commented 6 years ago

I'm confused - why isn't the Docker image you'd spin up locally not "the full API"? Isn't that one of the benefits of Docker, so that the app you run locally and the one you deploy into production are identical?

znmeb commented 6 years ago

It's the full API for the one database we had when we built the image. We have more data now, which will mean more models and more API endpoints and probably more RAM used.

BrianHGrant commented 6 years ago

So we have some options to profile python and django behavior, including running DEBUG true with the gunicorn server (-p) connecting with aws db, some usage of the django DEBUG toolbar (not currently installed) or maybe through new relic if we need some more advance info not provided by docker stat

That said there were some complexities to the transportaation project last year that didn't exist when I left a bit ago, will catch up this weekend but not sure if this will be an issue.

Good data on usage is awesome and great though.

MikeTheCanuck commented 6 years ago

Let's not go overboard here - the most significant information we'll need to know is roughly how much RAM the Django app(s) in the container will consume, so that we can allocate a sufficient amount of RAM in the AWS CloudFormation template for this container. We generally start out with 100MB and bump it by increments of 100 from there, and spent a lot of time last year debugging containers that wouldn't stay running because we had no idea what kind of memory load they would have.

However, we're not just going to throw RAM at these - this isn't an unlimited resource - so if there's some risk that they'll need more than 100MB, let's get a rough number based on some rough characterization. Thanks!

znmeb commented 6 years ago

By the way, wouldn't DEBUG=True use more RAM?

iant01 commented 6 years ago

Silly question... was the transportation container last year running a database rather than connecting to a database server or was it a hybrid of both (keeping large amounts of data local after grabbing it from a remote DB server?

bhgrant8 commented 6 years ago

Yes DEBUG would use more RAM.

But here is where I've got to:

docker stats gives a streaming output, so a PIT of memory usage and a few other stats:

CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS

I then ran this on the docker container transportation-system-backend_api_production_1 on my host machine. Using the current api

During startup of container using the prod flag ./bin/start.sh -p and connecting to the aws hosted db, we see CPU % maxing out around 85%, with the memory usage going to around 152 MiBs.

transportsystems_docker_stat

The thing I was seeing though is that the MEM USAGE did not seem to drop by more then a few MiBs, a few queries using filters on the crash data, I made it to a ~225MiBs. So started looking into what this figure actually included.

First, I found Google's cAdvisor (https://github.com/google/cadvisor). This provides a GUI and provides 60 seconds of historical data, so a bit more useful then docker stat.

screen shot 2018-05-19 at 12 13 04 pm

Looking into the MEM usage came across this issue, which documents what the different types of memory are being recorded:

https://github.com/google/cadvisor/issues/638

tldr is:

Hot is the working set - pages that has been recently touched as calculated by the kernel.

Total includes hot + cold memory - where cold are the pages that have not been touched in a while and can be reclaimed if there was global memory pressure.

or another way:

Total (memory.usage_in_bytes) = rss + cache Working set = Total - inactive (not recently accessed memory = inactive_anon + inactive_file)

So question becomes which is most important number?

bhgrant8 commented 6 years ago

@iant01 I feel like there was some type of hybrid data store going on, but was not directly on project last year and was not completely sure of the full magic that was happening.

MikeTheCanuck commented 6 years ago

Awesome data Brian, thank you.

When we allocate memory to each container, there’s no memory management to worry about - as in, the “cold” memory that could be reclaimed probably wouldn’t be, because there’s nothing else in the container that would appreciably request contended memory (it’d all be consumed by one process - gunicorn, Python, whatever the runtime host is).

So given we’re doing hard allocations per container, I’m going to conservatively assume that we should use the Total - and then round up to the nearest 100 (just to give us a little breathing room for edge cases and future API enhancements).

Based on this data, I’m inclined to allocate 300 MB to this transportation-systems container.

znmeb commented 6 years ago

I've got the merged database ready for testing - I'm planning to build a local development environment from it at the May 20 build session so we can see what we have.

iant01 commented 6 years ago

All of the discussion on memory use should be moved to its own new issue, this issue was intended for creation of the service task to get things going in ECS.

This issue can be closed once all the Memory discussion is in its own issue and PR 16 has been merged.

iant01 commented 6 years ago

On the issue of which memory size is relevant it would be the Total memory size.