Closed MikeTheCanuck closed 6 years ago
This is a single task broken out from #3, as we don't seem to be making much progress.
Is there a specific project that needs this way of making databases? We've made them from CSVs for Transportation Systems one at a time. I can talk you through it. Here's a short how-to:
psql
client. This can be the same machine as the server or a different one. Example: the transportation ridership data: https://github.com/hackoregon/ingest-ridership-data/blob/master/ingest.psql
How many of these do you need to build? The code in https://github.com/hackoregon/ingest-ridership-data/ is Dockerized; all you'd need to change is the DDL and the CSV file names for any single-CSV database.
Scaling to multiple CSV files is simply writing an outer loop to cover all of them. Each CSV file has three steps:
\copy
to load the CSV data into the table.geometry
column.\copy
parser may not work for some date fields; you'll have to read them as text
and convert them to date / time stamps. I had to do that for our big "congestion" dataset.Work in progress for multiple (12) CSVs: https://github.com/hackoregon/transportation-congestion-analysis/blob/master/src/data/create-database.bash
Rewrote requirements to focus on pg_dump
rather than CSV input.
It turns out the AWS command-line client is available on the Amazon Linux 2 server - yum install -y awscli groff
. groff
is necessary only if you want to do aws help
at the command line.
So all you'd need to do is write scripts that fetch the pg_dump backup files from S3 using the aws
command and restore them with pg_restore
. No need to download them, or worse, upload them, from a laptop / workstation.
Updated problem statement based on our research, to wit:
s3://
or https://
protocol - only traditional file descriptors will doaws
CLI tools need to inherit permissions either explicitly through AWS keys that have been uploaded to the EC2 machine -OR- through a Role assigned to the EC2 machine (which Role has been assigned at least the 'AmazonS3ReadOnlyAccess' Policy)Also, from what I'm finding on the Amazon Linux 2 Docker image, we will probably need to allow for teams creating compressed SQL dumps rather than the pg_restore
custom format we've been using. For the creator, it's still a pg_dump
command, optionally piped to gzip -c
for compression, For the restore script, however, it's a psql
command for uncompressed and a gzip -dc
piped to psql
for the compressed oned.
I have this all automated in the next release of data-science-pet-containers
- I can give you a test dump now for passenger_census
. My backup-creation script now makes both versions.
Thanks Ed. Let me know where to find the test dump and please type out as explicit a command line as you can for me to implement the restore operation. I’m following your thinking st the conceptual level but it’d save me hours of brute-forcing to see exactly how you assemble that restore operation.
This work has evolved into a set of instructions for using compressed SQL to backup and restore an existing database. Since they're one-liner commands, there's no point now in generating a script to streamline the operation:
Backup databases into compressed SQL format https://github.com/hackoregon/civic-devops/blob/master/docs/HOWTO-create-backup-for-new-database-creation.md
Restore databases from compressed SQL format https://github.com/hackoregon/civic-devops/blob/master/docs/HOWTO-rebuild-the-centralized-database-service.md#restore-databases-from-backup
Scenario
we have a
pg_dump
backup file already stored in S3, which needs to become a running PostgreSQL database instance to which an API can connect.Problem
we need a script that performs the following steps:
postgres
database user and verifies the database is empty (if database is not empty, stop the script and print message to operator).Prerequisites
Extra credit
pg_dump
file to be restored directly over the network between S3 and the EC2 instance, without having to copy thepg_dump
file to the local hard drive of the computer used by the person running this script (i.e. slow download from S3 to laptop, then slowpg_restore
from laptop to EC2). This is not required, but if there's an obvious way to force a direct AWS-to-AWS network connection, that'd save hours of time waiting for the download or restore operations to complete.Out of Scope