Create new EC2 VM for hosting the consolidated Development database

MikeTheCanuck commented 6 years ago

ToDo List - EC2 Instance

[x] Give the VM a name = hacko-postgresql-development
[x] Select AMI = Amazon Linux 2-based AMI (ami-7f43f307)
[x] Select VM instance type = t2.large
[x] Select an existing network = "vpc-b841bedf | public database"
[x] allocate an elastic IP = (automatically allocated)
[x] create/select an IAM role (assigned to #60)
[x] add/allocate external storage = EBS volume of 100-500GB
[x] add any obvious tags
[x] select existing security group = hacko-public-database
[x] determine if any of this needs to be added to our cloudformation scripts (assigned to #61)

This issue fulfils part of #18.

MikeTheCanuck commented 6 years ago

Paging @iant01 - could use some advice on most of the above decisions:

e.g. selecting an AMI - unless there's a PostgreSQL-optimized AMI, I'd probably select Amazon Linux 2 LTS Candidate AMI 2017.12.0 (HVM), SSD Volume Type - ami-7f43f307 because it's "tuned for EC2 performance and security" and because selecting an Amazon AMI seems to give us more chance of supportability longer-term than some Community AMI
e.g. selecting a VM instance type - knowing that we're going to have a lot of data in databases, wouldn't we want to allocate as much memory as possible to allow PostgreSQL to cache most or all of the queries (thus using an "r" instance)

znmeb commented 6 years ago

What is AMI? CentOS / RHEL? Ubuntu 16.04LTS? It really matters because everything we've done in Transportation so far has been on Debian in containers. It's not a tough job to port to Ubuntu. Going to CentOS / RHEL is going to be more challenging.
If this machine is just going to be running PostgreSQL, what's the cost difference for this versus a dedicated PostgreSQL server like RDS?
Ultimately we want to do what's easiest for operations - is one big PostgreSQL server the best case for that?

As to sizing, for those teams that have databases built, I can help plan that. If we host the full TriMet congestion database, that's 30 GB right there, and that doesn't leave room for working space. We can always add virtual disks and tablespaces, however. That's a tricky manual process but I've done it and it's well-documented.

iant01 commented 6 years ago

Ok, first we need to answer some questions to determine an instance size (at least to start with) 1) during development does it need to be the same size as what may be needed in production (I would say no if concerned with cost) 2) unlike GCP where the memory and cpu can be scaled, do you want to pay for unused cpu and network performance to have the memory size to have the entire table space in memory (either in development or production) 3) does postgre require all table space to be in memory? (Most relational databases have tuning parameters to indicate how much memory to allocate to table and quiery space. 4) if for dev and even production, an instantanious quiery responce is not a requirement (and not looking at thousands or even hundreds of quieries a second) go for something smaller that just exceeds the needed current requirements andscale up the instance later. 5) don't confuse on disk size of database with size of root disk needed for the instance, the database itself should be on a separate EBS volume, not on the root disk.

Also, AMI decision is somewhst separate from instance size (type) decision (there are some cross over items to consider such as root size of AMI since a smaller instance may not have a large enough root disk to host it, also root disk type EBS vs. Instance)

Based on what Ed has indicated, currently debian is being used and there may be work involved to get things moved/running on a different flavor of linux (umbutu, RHEL/Suse, AWS linux 2). Also Ed indicated PostgreSQL 10 and PostGIS 2.4, I need to look to see if there are and AWS AMIs that have this. Community AMI can have risks involved and ones from the market place cost money. AWS RDS postgre is 9.6.x and postgis is 2.3 as of this week, people are reporting they can spin up rds with postgres 10.1 successfully, but AWS has not officially indicated anything above 9.6.x is available/supported and no indication if postgis 2.4 is there yet.

Given this, I suspect short term we will need to roll our own AMI with the version of PostgreSQL/PostGIS We can get by on (ie. Not the bleeding edge version 10.2, 2.4., but 10.1, 2.4.0 or 2.3.x) or a lower current AWS supported versions. Also build it off a current base AWS AMI that has a supported linux version/brand we can work with.

As we get closer to demo, all this may and probably will change as AWS moves forward with thier supported AMIs

MikeTheCanuck commented 6 years ago

AMI requirements

Use an AWS Linux 2 instance for maximum supportability in case we decide to migrate databases to RDS in the future
EBS root disk of sufficient size to support the use of PostgreSQL and PostGIS

Instance type minimum requirements for database server for development (not production)

dual CPU
sufficient memory to support all expected queries (not try to in-memory all tables of all databases being hosted)
out of scope - host the databases on the root disk (externally-hosted database files makes snapshots and backups easier)

External storage minimum requirements

sufficient space to host the data sizes we initially expect from all teams (according to the data survey)

Later considerations

configure the memory allocation for queries vs tables later on as we learn about the performance characteristics of
separate out larger, more active databases onto a separate server if required
auto-schedule shutdown of the machine to save money for non-working times e.g. between midnight and 6am

BrianHGrant commented 6 years ago

So just going to leave this though out there, maybe too much overhead for the moment if having an initial development database is a blocking issue for many teams

Last year's model for the APIs relies on a preconfigured stack of essentially empty ec2 instances within an ecs-cluster done through a CloudFormation Template:

.https://github.com/hackoregon/hackoregon-aws-infrastructure

We deployed these APIs through a CI/CD model using Travis as a connector from github, which tested and built containers stored in our private ECR.

https://github.com/hackoregon/backend-service-pattern/blob/master/.travis.yml https://github.com/hackoregon/backend-service-pattern/blob/master/bin/push https://github.com/hackoregon/backend-service-pattern/blob/master/bin/ecs-deploy.sh

I am a bit fuzzy on this portion but I believe our AutoScaling policy in the CloudFormation documents and the image name in the services.yaml portion would pull the current image from ECR for the specific Service and build it (or maybe it is a push from ECR when it is updated?)

https://github.com/hackoregon/hackoregon-aws-infrastructure/blob/master/infrastructure/ecs-cluster.yaml#L80

https://github.com/hackoregon/hackoregon-aws-infrastructure/blob/master/services/transport-service/service.yaml

Teams were left to a bunch of manual configuration on databases that haven't been maintained since - centralizing this work could help but still is manual labor vs. infrastructure as code
At some point though, a base AMI had been deployed to an ec2 instance for each of the individual databases, a manual way of doing what cloudformation does for the api instances
RDS could be nice but is expensive for the size of the databases and amount of use we actually anticipate at this time.

Questions I am leading to is:

Once the base AMI is figured out for compatibility (we did run the current flavor of PostGIS on emergency-response and I think other teams if that helps as a starting point), could we handle further configuration of the actual environment through a similar model as the api, ie: a dockerized github repo, deployed to travis, to a preconfigured service within the cloudformation template?

We have the benefits of:

common deploy model
infrastructure as code
testability
@znmeb's work already done towards a postgis container, which we can then decide on versions/changes as an organization as needed
possibility of pulling database loads and changes in a ci/cd model from s3 backups/sql scripts as well (though also leaves open possibitily of manually loading if this proves too much of overhead

BrianHGrant commented 6 years ago

oops, forgot references, will edit my last comment in a minute with links

MikeTheCanuck commented 6 years ago

Thanks @BrianHGrant - this seems like a good thing to consider for the future. At the moment, standing up a central database to get all five project running is more important than standing up a reproducible database.

(The design of the ECS, Docker, Travis infrastructure of the AWS API layer is definitely still in scope this year, but we are not precluding a different design for our AWS data layer, which is more "durable infrastructure, iterative data loading" than "iterative infrastructure and iterative code loading" that the API and front-end layers follow.)

We are not interested in adding the Docker complexity to the AWS data layer at this time - the benefits to our central infrastructure are unclear, and won't help us one bit if we decide to migrate to RDS (to reduce our operations overhead, for example). We have enough moving parts to worry about already, and taking the incremental improvement of one DB server over the proliferation of bespoke DB servers last year is my primary goal at this time (alongside "getting a database up and running for each 2018 project").

We plan to explore writing as many (bash) scripts as needed to support migration of data from S3 to the central AWS data layer, so that this is less "how the fsck did we do that?" each time we touch data. Any help in constructing those would be welcome.

znmeb commented 6 years ago

To be fair, I don't know that we absolutely need PostgreSQL 10 / PostGIS 2.4 - I just build on those because they're what's currently on the desktop. So my fundamental question is whether the costing for an RDS instance is more than if we use the same sized CPU, RAM and magnetic / SSD disks and spin it up ourselves.

If the numbers pencil out for RDS over rolling our own, I'll do the "downgrade" testing to make things work with 9.6 / 2.3.

znmeb commented 6 years ago

@iant01 Once we spin up an RDBMS, upgrading is not a viable option. If we spin up 9.6 / 2.3 we are committed to that through demo day!

IMHO this is a classic build vs. buy decision - are the long-term costs with an RDS instance lower than the long-term costs of Hack Oregon custom code and operations for a roll-our-own? My bias is towards buy, but that's my corporate IT background.

Josh Berkus did a paper on this a few years back. RDS turned out to be very expensive but Josh has been a PostgreSQL developer for a long long time. What's easy for him would not necessarily be easy for us in production.

https://www.youtube.com/watch?v=WV5P2DgxPoI

MikeTheCanuck commented 6 years ago

@znmeb yes, RDS is (apples-to-apples) more costly than running psql in our own EC2 machine. How much more is a question, as costs keep changing as Amazon balances between maximizing profits and keeping up with the competition.

We are in the unenviable position of both having "minimize budget" as a requirement and the reality that part-time volunteers are a thin line of defense for keeping up with cloud operations. There is no good choice, only slightly-better-of-the-bad choices.

That said, we've got this - additional debate on this isn't necessary here. What our development teams need to focus on is their core infrastructure requirements, and how they will be able to deliver an installable data package that we can consume into a PostgreSQL-compatible service layer. Sounds like pgdump is your weapon of choice Ed, so we should be able to wrangle that no matter what postgres implementation we choose.

And it looks like Amazon's package infrastructure only support postgres through 9.6, so that's what we'll be deploying.

znmeb commented 6 years ago

@MikeTheCanuck Sounds good

Do you know if we'll have PostGIS 2.3 or 2.4? I don't see any problems with PostgreSQL 9.6 - it's well-maintained by postgresql.org. But I had a lot of problems with earlier PostGIS versions last year and I'd much rather have 2.4 if at all possible.
For pg_restore, having the input dump file on somebody's laptop or workstation is a non-starter - it takes hours to get our 6 GB dump of the congestion database up to "the cloud" on Comcast's puny 6 megabit per second uploads. How do we get the .backup files / directories up to AWS S3 faster?

znmeb commented 6 years ago

A note on PostgreSQL capacity planning:

For most of our projects, the major limiting factor is going to be disk space, not RAM or cores. PostgreSQL's default settings rarely need to be changed, and you'll usually get a warning that tells you what's happening and how to fix it. And most modern app stacks like Django use the RDBMS as a dumb persistence engine, not as a place where application logic lives.

znmeb commented 6 years ago

Well, folks, as you may have heard, it turns out that binary RPM packages for PostGIS are not available in the repositories that ship on Amazon Linux 2 as of 2018-02-24. I've posted a question in their support forum asking when PostGIS will be there: https://forums.aws.amazon.com/message.jspa?messageID=832523#832523.

I can build PostGIS and its numerous dependencies from source, but it will take some time, and it will be a support and deployment nightmare. So I'm officially putting one of my opposable thumbs down on Amazon Linux 2. :-(

MikeTheCanuck commented 6 years ago

Thanks for the update @znmeb. Yes it looks like the Amazon repositories (both the default and amazon-linux-extras) don't include PostGIS packages.

However, I've pulled together some fragments of info that look like a potential avenue - that others have succeeded in the past adding an "Amazon Linux AMI" repository link to their machine and then installing the PostGIS packages. See the Amazon Linux AMI 2015.03 - x86_64 link here, which others have configured in an Amazon Linux machine like so:

sudo rpm -Uvh https://download.postgresql.org/pub/repos/yum/9.6/redhat/rhel-7-x86_64/pgdg-redhat96-9.6-3.noarch.rpm

This results in the following output, which I think we can work with:

$sudo yum list | grep postgis
postgis22_96.x86_64                     2.2.6-2.rhel6                 pgdg96    
postgis22_96-client.x86_64              2.2.6-2.rhel6                 pgdg96    
postgis22_96-debuginfo.x86_64           2.2.6-1.rhel6                 pgdg96    
postgis22_96-devel.x86_64               2.2.6-2.rhel6                 pgdg96    
postgis22_96-docs.x86_64                2.2.6-2.rhel6                 pgdg96    
postgis22_96-utils.x86_64               2.2.6-2.rhel6                 pgdg96    
postgis23_96.x86_64                     2.3.6-2.rhel6                 pgdg96    
postgis23_96-client.x86_64              2.3.6-2.rhel6                 pgdg96    
postgis23_96-debuginfo.x86_64           2.3.4-1.rhel6                 pgdg96    
postgis23_96-devel.x86_64               2.3.6-2.rhel6                 pgdg96    
postgis23_96-docs.x86_64                2.3.6-2.rhel6                 pgdg96    
postgis23_96-utils.x86_64               2.3.6-2.rhel6                 pgdg96    
postgis24_96.x86_64                     2.4.3-1.rhel6                 pgdg96    
postgis24_96-client.x86_64              2.4.3-1.rhel6                 pgdg96    
postgis24_96-debuginfo.x86_64           2.4.1-1.rhel6                 pgdg96    
postgis24_96-devel.x86_64               2.4.3-1.rhel6                 pgdg96    
postgis24_96-docs.x86_64                2.4.3-1.rhel6                 pgdg96    
postgis24_96-utils.x86_64               2.4.3-1.rhel6                 pgdg96    
postgis2_96.x86_64                      2.3.2-1.rhel6                 pgdg96    
postgis2_96-client.x86_64               2.3.2-1.rhel6                 pgdg96    
postgis2_96-debuginfo.x86_64            2.3.2-1.rhel6                 pgdg96    
postgis2_96-devel.x86_64                2.3.2-1.rhel6                 pgdg96    
postgis2_96-docs.x86_64                 2.3.2-1.rhel6                 pgdg96    
postgis2_96-utils.x86_64                2.3.2-1.rhel6                 pgdg96

If this works, we'll still follow the Amazon Linux playbook as much as possible, only diverting where absolutely necessary (and when it works).

znmeb commented 6 years ago

That-s the PGDG repo - the standard repo where the PostgreSQL and PostGIS put their RHEL-compatible RPMs. They're compiled for RHEL6. Not RHEL7/CentOS 7. And not Amazon Linux 2. Compatible libraries may or may not exist in Amazon Linux 2.

What you'll have to do is set up a build server running Amazon Linux and download the source RPMs, then recompile them. And there are dependencies from EPEL - you'll need those too. This is a risky path. I've been there and I've got the scars. This is the vendor's job - Amazon's - not the customer's - Hack Oregon.

MikeTheCanuck commented 6 years ago

VM instance type:

t2.large (2 CPUs and 8GB RAM) vs r4.large (2 CPUs and 15GB RAM)
at On Demand pricing for Linux instances, that's $0.0928 per Hour ($66.82/mo) vs $0.133 per Hour ($95.76/mo) = +$29/mo for extra RAM

Based on the budget guidance I've been hearing so far, we're going to start out with a t2.large and scale up as demand increases. Scaling up requires only momentary downtime - shut down the instance, change the instance type, and start the instance, which generally takes less than a minute.

MikeTheCanuck commented 6 years ago

Network selection:

the two networks available right now are vpc-b841bedf | public database and vpc-a38acdc4 | hacko-integration
existing 2017 PostgreSQL database EC2 instances are currently on VPC ID = vpc-b841bedf. All other details: -- Postgres Budget = Subnet ID = subnet-8794fddf, Private IP address = 172.31.2.229, Elastic IP address = 35.166.152.102 -- Postgres Homelessness = Subnet ID = subnet-8794fddf, Private IP address = 172.31.8.240, Elastic IP address = 35.166.197.229 -- Postgres Housing = Subnet ID = subnet-8794fddf, Private IP address = 172.31.13.26, Elastic IP address = 35.165.131.152 -- Postgres Emergency Response = Subnet ID = subnet-eeccd78a, Private IP address = 72.31.25.230, Elastic IP address = 35.162.107.228 -- Postgres Transportation = Subnet ID = subnet-eeccd78a, Private IP address = 172.31.24.100, Elastic IP address = 35.162.168.92

MikeTheCanuck commented 6 years ago

Elastic IP allocation:

I've compared the Elastic IPs EC2 page vs the Instances EC2 page and it appears that the Postgres boxes have Elastic IPs automatically allocated to them? The five addresses above certainly aren't listed on the Elastic IPs page.

Which implies that as a result of standing up an EC2 instance, it's automatically allocated an Internet-accessible IP address. So let's assume this point is moot.

MikeTheCanuck commented 6 years ago

EBS Volume for database storage:

according to the Feb "Data & Tech Questionnaire", there will be at least 85GB of data that we know of to start
many projects expecting to have much more data soon, and growing throughout the season
It appears from my experiments that once you assign an EBS volume to an EC2 instance, you can't grow it later

Thus, we have to predict the ultimate size of our data and pre-allocate that as an EBS volume up front. Or risk having to migrate the database server and its data later to another EC2 instance.

(Theoretically we ought to have robust data backup & restore procedures, and be able to migrate to a new server at any time - making this a trivial risk. That said, this is neither something we have extensive experience with, nor the time to plan out ahead of standing up our PostgreSQL server.)

As a wild-ass guess of how much data volume we'll need in development this season, it seems like 500GB should more than cover our development needs - how could we possibly see data grow beyond that? [tongue planted firmly in cheek]

znmeb commented 6 years ago

Check with @BrianHGrant to be sure but I think we are dropping our large (4 GB raw, 30 GB of PostgreSQL tablespace) dataset, if that's included in your 85 GB.

And even if you can't grow a volume, as long as you can add new volumes you can put PostgreSQL databases on them via tablespaces.

There are standard operating procedures for migrating PostgreSQL databases to bigger boxes as well. If you want to go that route let me go book hunting.

MikeTheCanuck commented 6 years ago

We've got a strategy in place to work with this:

@iant01 will develop a script to help us grow an EBS volume and extend the filesystem (without having to reinstall or backup/restore PostgreSQL)
we have fully automated the creation of the EC2 machine instance as of PR #33

Once Ian's script drops and is tested, we'll be able to go live with the database server.

MikeTheCanuck commented 6 years ago

As of last night, the EC2 VM hosting the development PostgreSQL instance is live, and one project team has their databases and credentials available.

Issues #60 and #61 address the two outstanding concerns, so this item can now be closed.

hackoregon / civic-devops