unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
106 stars 21 forks source link

Backup process to the Internet Archive #184

Closed konklone closed 9 years ago

konklone commented 9 years ago

This adds a backup script to the root of the repository, to back up reports and bulk data to the Internet Archive.

It's mildly intended to be generalizable (e.g. a backup to one's own S3 account), but not much -- it's hooked pretty tightly into scripts/backup/ia.py.

More details about how the Internet Archive's uploading system works can be found in #63. #63 won't be complete until the collection is fully uploaded to IA, and syncing on a regular basis. This PR covers the scripts that will be used in these processes, and that have been used to update what's there in the collection right now.

Backing up individual reports

The primary use is like:

./backup [--ig] [--year] [--report_id] [--force] [--meta]

If you want to specify a specific --report_id, you have to specify an --ig and a --year too. By default, the backup script will mark archived reports with a little ia.done file in that report's directory, and not upload reports that already have one. The --force flag will override this behavior and always upload the specified reports. The --meta tag will only upload a report's .json data and not the report itself, which is really only useful for testing.

So a cronjob that wanted to keep the report archive in sync with IA could just run ./backup every X hours or days.

Reports will be uploaded to the usinspectorsgeneral collection at https://archive.org/details/usinspectorsgeneral, that the Internet Archive created for the project after we uploaded ~90-100 reports using this script. The reports will be submitted to the Archive's "derivation queue", which should provide each PDF-based report with a pleasant little report reading interface, like this one.

Backing up a giant bulk data file

An alternate use is:

./backup --bulk=us-inspectors-general.bulk.zip

If given a --bulk flag with a path to a zip file, this will upload that file directly to the Archive, at https://archive.org/details/us-inspectors-general.bulk. This item is marked as part of the usinspectorsgeneral collection, and is the permanent link I'm giving people (and linking to from the collection's description) as a quick way to download all ~40GB of the reports we have so far.

Using the bulk data backup means creating a .zip file of everything, excluding the .done files:

cd /path/to/inspectors-general/data
zip -r ../us-inspectors-general.bulk.zip * -x "*.done"

Then uploading that zip file to the Internet Archive with:

cd /path/to/inspectors-general
./backup --bulk=us-inspectors-general.bulk.zip

This would also be suitable to do via cron, but maybe monthly or weekly. It's an expensive operation for everyone.

Next steps

The next step is a full re-download of reports. @divergentdave has done yeoman's work in detecting and removing duplicate IDs, and integrating duplicate ID checking into the main report fetching process. @spulec did similarly rigorous work going back and adding report type detection to all the scrapers. These efforts won't apply to the full archive I have on my server, or that Sunlight has on theirs, unless we re-download everything. It'd be more efficient to migrate the collection in-place, but that's not really feasible.

I've initiated that re-download on my servers. Once that's done, I'll do a full backup to the Internet Archive, using the methods above.

Finally, once those are done, I'll instrument a couple of cronjobs that will keep things in sync automatically. Once that's done, #63 will be fixed, and we'll have a pretty nice system on our hands.

parkr commented 9 years ago

It all looks sane to me. :+1: