sfbrigade / sf-openreferral-data

Archive of data for the SF OpenReferral project
https://github.com/sfbrigade/sf-openreferral
8 stars 5 forks source link

We can import/export through pg_dump #6

Closed jasonlally closed 9 years ago

jasonlally commented 10 years ago

It appears we can run imports and exports (that just use the pg_dump command to import and export to the database) @fishmanadam @JesseAldridge @dana11235 @benjamin0

script/export
script/import

This could allow us to systematically load data and make changes like adding categories in an iterative fashion without losing any work.

Process would be to load data locally through rake load_data until all errors for a load file are addressed, then load to heroku, then export the data. People can then make updates, verifications, etc on heroku, and then when we're ready to load new data, we can do that without fear of overwriting work. It'll allow us to do some work in parallel and bring additional help on to verify, update, and add categories while we work on cleaning up the load files piece by piece.

Want to verify with Sameer on Wednesday a number of things, but my understanding is that future migrations to new schemas will be automated or at least well documented enough that we can go from one db schema to the new one without losing all this work. But let's see if it's already in the plan, because if it's not, it should be.

francisli commented 10 years ago

Just a heads-up that you'll probably need to modify those scripts a bit to have the behaviour you describe. pg_dump is typically used for backup/restore, so the default behaviour is to drop and re-create the database on import (so you'll lose and effectively overwrite existing the data).

dana11235 commented 10 years ago

pg_dump creates a full backup of the database. It's good for creating an archival copy, but not so good for doing an export that can be easily manipulated or incrementally loaded. Heroku actually has functionality that makes it easy to store snapshots of the DB (heroku pgbackups).

jasonlally commented 10 years ago

Right, understood, that's what I'm getting at. We would blow out the database, and restart with the archived database before doing a load of new data into the live (heroku) version. Any new data loaded would not include previous data already in the database, it would only be a completely new set.

For example, after loading SS4Women datasets to our satisfaction, we wouldn't touch those anymore and move on to other datasets, meanwhile someone can be verifying data on server, adding categories, etc. We can be loading and testing locally on new datasets again, until we've gotten the automated bit done to our satisfaction and then get that up on the server after running an archive, delete, import.

It'd be easier for me to sketch what I mean out, but we could set up a workflow that would enable some "multi-threading" of the data work.

On Tue, Jun 3, 2014 at 11:35 AM, Dana Levine notifications@github.com wrote:

pg_dump creates a full backup of the database. It's good for creating an archival copy, but not so good for doing an export that can be easily manipulated or incrementally loaded. Heroku actually has functionality that makes it easy to store snapshots of the DB (heroku pgbackups).

— Reply to this email directly or view it on GitHub https://github.com/sfbrigade/sf-openreferral-transform-scripts/issues/6#issuecomment-45003205 .

dana11235 commented 10 years ago

We can also setup multiple apps on Heroku (for free), so if you want to have a staging server to play around with, that's fine. It's easy to transfer the db between these apps.