hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

Create new EC2 VM for hosting the consolidated Development database #19

Closed MikeTheCanuck closed 6 years ago

MikeTheCanuck commented 6 years ago

ToDo List - EC2 Instance

This issue fulfils part of #18.

MikeTheCanuck commented 6 years ago

Paging @iant01 - could use some advice on most of the above decisions:

znmeb commented 6 years ago
  1. What is AMI? CentOS / RHEL? Ubuntu 16.04LTS? It really matters because everything we've done in Transportation so far has been on Debian in containers. It's not a tough job to port to Ubuntu. Going to CentOS / RHEL is going to be more challenging.

  2. If this machine is just going to be running PostgreSQL, what's the cost difference for this versus a dedicated PostgreSQL server like RDS?

  3. Ultimately we want to do what's easiest for operations - is one big PostgreSQL server the best case for that?

As to sizing, for those teams that have databases built, I can help plan that. If we host the full TriMet congestion database, that's 30 GB right there, and that doesn't leave room for working space. We can always add virtual disks and tablespaces, however. That's a tricky manual process but I've done it and it's well-documented.

iant01 commented 6 years ago

Ok, first we need to answer some questions to determine an instance size (at least to start with) 1) during development does it need to be the same size as what may be needed in production (I would say no if concerned with cost) 2) unlike GCP where the memory and cpu can be scaled, do you want to pay for unused cpu and network performance to have the memory size to have the entire table space in memory (either in development or production) 3) does postgre require all table space to be in memory? (Most relational databases have tuning parameters to indicate how much memory to allocate to table and quiery space. 4) if for dev and even production, an instantanious quiery responce is not a requirement (and not looking at thousands or even hundreds of quieries a second) go for something smaller that just exceeds the needed current requirements andscale up the instance later. 5) don't confuse on disk size of database with size of root disk needed for the instance, the database itself should be on a separate EBS volume, not on the root disk.

Also, AMI decision is somewhst separate from instance size (type) decision (there are some cross over items to consider such as root size of AMI since a smaller instance may not have a large enough root disk to host it, also root disk type EBS vs. Instance)

Based on what Ed has indicated, currently debian is being used and there may be work involved to get things moved/running on a different flavor of linux (umbutu, RHEL/Suse, AWS linux 2). Also Ed indicated PostgreSQL 10 and PostGIS 2.4, I need to look to see if there are and AWS AMIs that have this. Community AMI can have risks involved and ones from the market place cost money. AWS RDS postgre is 9.6.x and postgis is 2.3 as of this week, people are reporting they can spin up rds with postgres 10.1 successfully, but AWS has not officially indicated anything above 9.6.x is available/supported and no indication if postgis 2.4 is there yet.

Given this, I suspect short term we will need to roll our own AMI with the version of PostgreSQL/PostGIS We can get by on (ie. Not the bleeding edge version 10.2, 2.4., but 10.1, 2.4.0 or 2.3.x) or a lower current AWS supported versions. Also build it off a current base AWS AMI that has a supported linux version/brand we can work with.

As we get closer to demo, all this may and probably will change as AWS moves forward with thier supported AMIs

MikeTheCanuck commented 6 years ago

AMI requirements

Instance type minimum requirements for database server for development (not production)

External storage minimum requirements

Later considerations

BrianHGrant commented 6 years ago

So just going to leave this though out there, maybe too much overhead for the moment if having an initial development database is a blocking issue for many teams

.https://github.com/hackoregon/hackoregon-aws-infrastructure

https://github.com/hackoregon/backend-service-pattern/blob/master/.travis.yml https://github.com/hackoregon/backend-service-pattern/blob/master/bin/push https://github.com/hackoregon/backend-service-pattern/blob/master/bin/ecs-deploy.sh

https://github.com/hackoregon/hackoregon-aws-infrastructure/blob/master/infrastructure/ecs-cluster.yaml#L80

https://github.com/hackoregon/hackoregon-aws-infrastructure/blob/master/services/transport-service/service.yaml

Questions I am leading to is:

Once the base AMI is figured out for compatibility (we did run the current flavor of PostGIS on emergency-response and I think other teams if that helps as a starting point), could we handle further configuration of the actual environment through a similar model as the api, ie: a dockerized github repo, deployed to travis, to a preconfigured service within the cloudformation template?

We have the benefits of:

BrianHGrant commented 6 years ago

oops, forgot references, will edit my last comment in a minute with links

MikeTheCanuck commented 6 years ago

Thanks @BrianHGrant - this seems like a good thing to consider for the future. At the moment, standing up a central database to get all five project running is more important than standing up a reproducible database.

(The design of the ECS, Docker, Travis infrastructure of the AWS API layer is definitely still in scope this year, but we are not precluding a different design for our AWS data layer, which is more "durable infrastructure, iterative data loading" than "iterative infrastructure and iterative code loading" that the API and front-end layers follow.)

We are not interested in adding the Docker complexity to the AWS data layer at this time - the benefits to our central infrastructure are unclear, and won't help us one bit if we decide to migrate to RDS (to reduce our operations overhead, for example). We have enough moving parts to worry about already, and taking the incremental improvement of one DB server over the proliferation of bespoke DB servers last year is my primary goal at this time (alongside "getting a database up and running for each 2018 project").

We plan to explore writing as many (bash) scripts as needed to support migration of data from S3 to the central AWS data layer, so that this is less "how the fsck did we do that?" each time we touch data. Any help in constructing those would be welcome.

znmeb commented 6 years ago

To be fair, I don't know that we absolutely need PostgreSQL 10 / PostGIS 2.4 - I just build on those because they're what's currently on the desktop. So my fundamental question is whether the costing for an RDS instance is more than if we use the same sized CPU, RAM and magnetic / SSD disks and spin it up ourselves.

If the numbers pencil out for RDS over rolling our own, I'll do the "downgrade" testing to make things work with 9.6 / 2.3.

znmeb commented 6 years ago

@iant01 Once we spin up an RDBMS, upgrading is not a viable option. If we spin up 9.6 / 2.3 we are committed to that through demo day!

IMHO this is a classic build vs. buy decision - are the long-term costs with an RDS instance lower than the long-term costs of Hack Oregon custom code and operations for a roll-our-own? My bias is towards buy, but that's my corporate IT background.

Josh Berkus did a paper on this a few years back. RDS turned out to be very expensive but Josh has been a PostgreSQL developer for a long long time. What's easy for him would not necessarily be easy for us in production.

https://www.youtube.com/watch?v=WV5P2DgxPoI

MikeTheCanuck commented 6 years ago

@znmeb yes, RDS is (apples-to-apples) more costly than running psql in our own EC2 machine. How much more is a question, as costs keep changing as Amazon balances between maximizing profits and keeping up with the competition.

We are in the unenviable position of both having "minimize budget" as a requirement and the reality that part-time volunteers are a thin line of defense for keeping up with cloud operations. There is no good choice, only slightly-better-of-the-bad choices.

That said, we've got this - additional debate on this isn't necessary here. What our development teams need to focus on is their core infrastructure requirements, and how they will be able to deliver an installable data package that we can consume into a PostgreSQL-compatible service layer. Sounds like pgdump is your weapon of choice Ed, so we should be able to wrangle that no matter what postgres implementation we choose.

And it looks like Amazon's package infrastructure only support postgres through 9.6, so that's what we'll be deploying.

znmeb commented 6 years ago

@MikeTheCanuck Sounds good

  1. Do you know if we'll have PostGIS 2.3 or 2.4? I don't see any problems with PostgreSQL 9.6 - it's well-maintained by postgresql.org. But I had a lot of problems with earlier PostGIS versions last year and I'd much rather have 2.4 if at all possible.
  2. For pg_restore, having the input dump file on somebody's laptop or workstation is a non-starter - it takes hours to get our 6 GB dump of the congestion database up to "the cloud" on Comcast's puny 6 megabit per second uploads. How do we get the .backup files / directories up to AWS S3 faster?
znmeb commented 6 years ago

A note on PostgreSQL capacity planning:

For most of our projects, the major limiting factor is going to be disk space, not RAM or cores. PostgreSQL's default settings rarely need to be changed, and you'll usually get a warning that tells you what's happening and how to fix it. And most modern app stacks like Django use the RDBMS as a dumb persistence engine, not as a place where application logic lives.

znmeb commented 6 years ago

Well, folks, as you may have heard, it turns out that binary RPM packages for PostGIS are not available in the repositories that ship on Amazon Linux 2 as of 2018-02-24. I've posted a question in their support forum asking when PostGIS will be there: https://forums.aws.amazon.com/message.jspa?messageID=832523#832523.

I can build PostGIS and its numerous dependencies from source, but it will take some time, and it will be a support and deployment nightmare. So I'm officially putting one of my opposable thumbs down on Amazon Linux 2. :-(

MikeTheCanuck commented 6 years ago

Thanks for the update @znmeb. Yes it looks like the Amazon repositories (both the default and amazon-linux-extras) don't include PostGIS packages.

However, I've pulled together some fragments of info that look like a potential avenue - that others have succeeded in the past adding an "Amazon Linux AMI" repository link to their machine and then installing the PostGIS packages. See the Amazon Linux AMI 2015.03 - x86_64 link here, which others have configured in an Amazon Linux machine like so:

sudo rpm -Uvh https://download.postgresql.org/pub/repos/yum/9.6/redhat/rhel-7-x86_64/pgdg-redhat96-9.6-3.noarch.rpm

This results in the following output, which I think we can work with:

$sudo yum list | grep postgis
postgis22_96.x86_64                     2.2.6-2.rhel6                 pgdg96    
postgis22_96-client.x86_64              2.2.6-2.rhel6                 pgdg96    
postgis22_96-debuginfo.x86_64           2.2.6-1.rhel6                 pgdg96    
postgis22_96-devel.x86_64               2.2.6-2.rhel6                 pgdg96    
postgis22_96-docs.x86_64                2.2.6-2.rhel6                 pgdg96    
postgis22_96-utils.x86_64               2.2.6-2.rhel6                 pgdg96    
postgis23_96.x86_64                     2.3.6-2.rhel6                 pgdg96    
postgis23_96-client.x86_64              2.3.6-2.rhel6                 pgdg96    
postgis23_96-debuginfo.x86_64           2.3.4-1.rhel6                 pgdg96    
postgis23_96-devel.x86_64               2.3.6-2.rhel6                 pgdg96    
postgis23_96-docs.x86_64                2.3.6-2.rhel6                 pgdg96    
postgis23_96-utils.x86_64               2.3.6-2.rhel6                 pgdg96    
postgis24_96.x86_64                     2.4.3-1.rhel6                 pgdg96    
postgis24_96-client.x86_64              2.4.3-1.rhel6                 pgdg96    
postgis24_96-debuginfo.x86_64           2.4.1-1.rhel6                 pgdg96    
postgis24_96-devel.x86_64               2.4.3-1.rhel6                 pgdg96    
postgis24_96-docs.x86_64                2.4.3-1.rhel6                 pgdg96    
postgis24_96-utils.x86_64               2.4.3-1.rhel6                 pgdg96    
postgis2_96.x86_64                      2.3.2-1.rhel6                 pgdg96    
postgis2_96-client.x86_64               2.3.2-1.rhel6                 pgdg96    
postgis2_96-debuginfo.x86_64            2.3.2-1.rhel6                 pgdg96    
postgis2_96-devel.x86_64                2.3.2-1.rhel6                 pgdg96    
postgis2_96-docs.x86_64                 2.3.2-1.rhel6                 pgdg96    
postgis2_96-utils.x86_64                2.3.2-1.rhel6                 pgdg96

If this works, we'll still follow the Amazon Linux playbook as much as possible, only diverting where absolutely necessary (and when it works).

znmeb commented 6 years ago

That-s the PGDG repo - the standard repo where the PostgreSQL and PostGIS put their RHEL-compatible RPMs. They're compiled for RHEL6. Not RHEL7/CentOS 7. And not Amazon Linux 2. Compatible libraries may or may not exist in Amazon Linux 2.

What you'll have to do is set up a build server running Amazon Linux and download the source RPMs, then recompile them. And there are dependencies from EPEL - you'll need those too. This is a risky path. I've been there and I've got the scars. This is the vendor's job - Amazon's - not the customer's - Hack Oregon.

MikeTheCanuck commented 6 years ago

VM instance type:

Based on the budget guidance I've been hearing so far, we're going to start out with a t2.large and scale up as demand increases. Scaling up requires only momentary downtime - shut down the instance, change the instance type, and start the instance, which generally takes less than a minute.

MikeTheCanuck commented 6 years ago

Network selection:

MikeTheCanuck commented 6 years ago

Elastic IP allocation:

Which implies that as a result of standing up an EC2 instance, it's automatically allocated an Internet-accessible IP address. So let's assume this point is moot.

MikeTheCanuck commented 6 years ago

EBS Volume for database storage:

Thus, we have to predict the ultimate size of our data and pre-allocate that as an EBS volume up front. Or risk having to migrate the database server and its data later to another EC2 instance.

(Theoretically we ought to have robust data backup & restore procedures, and be able to migrate to a new server at any time - making this a trivial risk. That said, this is neither something we have extensive experience with, nor the time to plan out ahead of standing up our PostgreSQL server.)

As a wild-ass guess of how much data volume we'll need in development this season, it seems like 500GB should more than cover our development needs - how could we possibly see data grow beyond that? [tongue planted firmly in cheek]

znmeb commented 6 years ago

Check with @BrianHGrant to be sure but I think we are dropping our large (4 GB raw, 30 GB of PostgreSQL tablespace) dataset, if that's included in your 85 GB.

And even if you can't grow a volume, as long as you can add new volumes you can put PostgreSQL databases on them via tablespaces.

There are standard operating procedures for migrating PostgreSQL databases to bigger boxes as well. If you want to go that route let me go book hunting.

MikeTheCanuck commented 6 years ago

We've got a strategy in place to work with this:

Once Ian's script drops and is tested, we'll be able to go live with the database server.

MikeTheCanuck commented 6 years ago

As of last night, the EC2 VM hosting the development PostgreSQL instance is live, and one project team has their databases and credentials available.

Issues #60 and #61 address the two outstanding concerns, so this item can now be closed.