Define script that initializes new database from PostgreSQL dump file in S3

MikeTheCanuck commented 6 years ago

Scenario

we have a pg_dump backup file already stored in S3, which needs to become a running PostgreSQL database instance to which an API can connect.

Problem

we need a script that performs the following steps:

SSH into the EC2 machine that is running the PostgreSQL server (or shunts the following commands over an SSH connection)
Reads a PostgreSQL dump file from a specified path in a specified S3 bucket (authenticating through Role-assigned-to-machine-instance or through AWS keys in the shell)
Connects to the local PostgreSQL database instance using the postgres database user and verifies the database is empty (if database is not empty, stop the script and print message to operator).
Restores the data in the backup file to the specified database instance.

Prerequisites

AWS CLI tools installed (already there in Amazon Linux AMI)
PostgreSQL CLI tools installed (already there once PostgreSQL has been installed)
AWS keys or assigned Role
PostgresSQL server is running
PostgreSQL database instance created (by PostgreSQL admin)
PostgreSQL user created, assigned Ownership of the target database instance (to be used by the data manager when creating/uploading database, and when running Django migrations)

Extra credit

It would be awesome if there was some way for the pg_dump file to be restored directly over the network between S3 and the EC2 instance, without having to copy the pg_dump file to the local hard drive of the computer used by the person running this script (i.e. slow download from S3 to laptop, then slow pg_restore from laptop to EC2). This is not required, but if there's an obvious way to force a direct AWS-to-AWS network connection, that'd save hours of time waiting for the download or restore operations to complete.
According to Ian, to perform this without network bandwidth costs requires us to create an Endpoint in the PostgreSQL server's current VPC, and this endpoint is for S3 (to get directly-through-AWS-backplane network copy)

Out of Scope

export existing data before wiping and restoring the new database dump

MikeTheCanuck commented 6 years ago

This is a single task broken out from #3, as we don't seem to be making much progress.

znmeb commented 6 years ago

Is there a specific project that needs this way of making databases? We've made them from CSVs for Transportation Systems one at a time. I can talk you through it. Here's a short how-to:

You need a PostgreSQL database server somewhere.
You need the command line psql client. This can be the same machine as the server or a different one.
You need to define the DDL for the table you are making!! This generally requires human interaction with the source of the data. You can guess, but you shouldn't, on the types of the data.

Example: the transportation ridership data: https://github.com/hackoregon/ingest-ridership-data/blob/master/ingest.psql

How many of these do you need to build? The code in https://github.com/hackoregon/ingest-ridership-data/ is Dockerized; all you'd need to change is the DDL and the CSV file names for any single-CSV database.

Scaling to multiple CSV files is simply writing an outer loop to cover all of them. Each CSV file has three steps:

The DDL - create the table with the correct column names and types
Do the \copy to load the CSV data into the table.
Do any post-processing.
- For Django APIs the tables all have to have a primary key.
- You may or may not want to add a geometry column.
- The \copy parser may not work for some date fields; you'll have to read them as text and convert them to date / time stamps. I had to do that for our big "congestion" dataset.

Work in progress for multiple (12) CSVs: https://github.com/hackoregon/transportation-congestion-analysis/blob/master/src/data/create-database.bash

MikeTheCanuck commented 6 years ago

Rewrote requirements to focus on pg_dump rather than CSV input.

znmeb commented 6 years ago

It turns out the AWS command-line client is available on the Amazon Linux 2 server - yum install -y awscli groff. groff is necessary only if you want to do aws help at the command line.

So all you'd need to do is write scripts that fetch the pg_dump backup files from S3 using the aws command and restore them with pg_restore. No need to download them, or worse, upload them, from a laptop / workstation.

MikeTheCanuck commented 6 years ago

Updated problem statement based on our research, to wit:

the script has to run in a shell on the EC2 machine
there is no apparent way to enable a shell script to explicitly read a file hosted in S3 via s3:// or https:// protocol - only traditional file descriptors will do
aws CLI tools need to inherit permissions either explicitly through AWS keys that have been uploaded to the EC2 machine -OR- through a Role assigned to the EC2 machine (which Role has been assigned at least the 'AmazonS3ReadOnlyAccess' Policy)
we can use the root volume for any downloads from S3 for files < 7GB (when the machine is created, that's about the amount of free space left on the root volume)

znmeb commented 6 years ago

Also, from what I'm finding on the Amazon Linux 2 Docker image, we will probably need to allow for teams creating compressed SQL dumps rather than the pg_restore custom format we've been using. For the creator, it's still a pg_dump command, optionally piped to gzip -c for compression, For the restore script, however, it's a psql command for uncompressed and a gzip -dc piped to psql for the compressed oned.

I have this all automated in the next release of data-science-pet-containers- I can give you a test dump now for passenger_census. My backup-creation script now makes both versions.

MikeTheCanuck commented 6 years ago

Thanks Ed. Let me know where to find the test dump and please type out as explicit a command line as you can for me to implement the restore operation. I’m following your thinking st the conceptual level but it’d save me hours of brute-forcing to see exactly how you assemble that restore operation.

MikeTheCanuck commented 6 years ago

This work has evolved into a set of instructions for using compressed SQL to backup and restore an existing database. Since they're one-liner commands, there's no point now in generating a script to streamline the operation:

Backup databases into compressed SQL format https://github.com/hackoregon/civic-devops/blob/master/docs/HOWTO-create-backup-for-new-database-creation.md
Restore databases from compressed SQL format https://github.com/hackoregon/civic-devops/blob/master/docs/HOWTO-rebuild-the-centralized-database-service.md#restore-databases-from-backup

hackoregon / civic-devops