hackoregon / civic-devops

Master collection point for issues, procedures, and code to manage the HackOregon Civic platform
MIT License
11 stars 4 forks source link

Define script that initializes new database from PostgreSQL dump file in S3 #17

Closed MikeTheCanuck closed 6 years ago

MikeTheCanuck commented 6 years ago

Scenario

we have a pg_dump backup file already stored in S3, which needs to become a running PostgreSQL database instance to which an API can connect.

Problem

we need a script that performs the following steps:

  1. SSH into the EC2 machine that is running the PostgreSQL server (or shunts the following commands over an SSH connection)
  2. Reads a PostgreSQL dump file from a specified path in a specified S3 bucket (authenticating through Role-assigned-to-machine-instance or through AWS keys in the shell)
  3. Connects to the local PostgreSQL database instance using the postgres database user and verifies the database is empty (if database is not empty, stop the script and print message to operator).
  4. Restores the data in the backup file to the specified database instance.

Prerequisites

Extra credit

Out of Scope

MikeTheCanuck commented 6 years ago

This is a single task broken out from #3, as we don't seem to be making much progress.

znmeb commented 6 years ago

Is there a specific project that needs this way of making databases? We've made them from CSVs for Transportation Systems one at a time. I can talk you through it. Here's a short how-to:

  1. You need a PostgreSQL database server somewhere.
  2. You need the command line psql client. This can be the same machine as the server or a different one.
  3. You need to define the DDL for the table you are making!! This generally requires human interaction with the source of the data. You can guess, but you shouldn't, on the types of the data.

Example: the transportation ridership data: https://github.com/hackoregon/ingest-ridership-data/blob/master/ingest.psql

How many of these do you need to build? The code in https://github.com/hackoregon/ingest-ridership-data/ is Dockerized; all you'd need to change is the DDL and the CSV file names for any single-CSV database.

Scaling to multiple CSV files is simply writing an outer loop to cover all of them. Each CSV file has three steps:

  1. The DDL - create the table with the correct column names and types
  2. Do the \copy to load the CSV data into the table.
  3. Do any post-processing.
    • For Django APIs the tables all have to have a primary key.
    • You may or may not want to add a geometry column.
    • The \copy parser may not work for some date fields; you'll have to read them as text and convert them to date / time stamps. I had to do that for our big "congestion" dataset.

Work in progress for multiple (12) CSVs: https://github.com/hackoregon/transportation-congestion-analysis/blob/master/src/data/create-database.bash

MikeTheCanuck commented 6 years ago

Rewrote requirements to focus on pg_dump rather than CSV input.

znmeb commented 6 years ago

It turns out the AWS command-line client is available on the Amazon Linux 2 server - yum install -y awscli groff. groff is necessary only if you want to do aws help at the command line.

So all you'd need to do is write scripts that fetch the pg_dump backup files from S3 using the aws command and restore them with pg_restore. No need to download them, or worse, upload them, from a laptop / workstation.

MikeTheCanuck commented 6 years ago

Updated problem statement based on our research, to wit:

znmeb commented 6 years ago

Also, from what I'm finding on the Amazon Linux 2 Docker image, we will probably need to allow for teams creating compressed SQL dumps rather than the pg_restore custom format we've been using. For the creator, it's still a pg_dump command, optionally piped to gzip -c for compression, For the restore script, however, it's a psql command for uncompressed and a gzip -dc piped to psql for the compressed oned.

I have this all automated in the next release of data-science-pet-containers- I can give you a test dump now for passenger_census. My backup-creation script now makes both versions.

MikeTheCanuck commented 6 years ago

Thanks Ed. Let me know where to find the test dump and please type out as explicit a command line as you can for me to implement the restore operation. I’m following your thinking st the conceptual level but it’d save me hours of brute-forcing to see exactly how you assemble that restore operation.

MikeTheCanuck commented 6 years ago

This work has evolved into a set of instructions for using compressed SQL to backup and restore an existing database. Since they're one-liner commands, there's no point now in generating a script to streamline the operation: