unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

Upload to the Internet Archive #63

Open konklone opened 10 years ago

konklone commented 10 years ago

Using their S3-compatible API: http://archive.org/help/abouts3.txt

I have an Archive account, under eric@konklone.com, and I generated my S3(-like) credentials. I'm not actually sure whether the code to do this upload belongs in this repository -- it could just as easily be a script in a public repo on my own account that runs as a cron on the same box -- but I'm including it here to solicit discussion, and to publicize that I want to get this stuff into the Archive.

I'll also be contacting the Archive directly to see if they have any above-and-beyond interest in this collection.

/cc @waldoj @spulec

Resources:

Todos:

konklone commented 10 years ago

I've gotten a unitedstates-data bucket going, which made a predictable URL, and auto-created a bunch of metadata files:

https://archive.org/download/unitedstates-data/

To test it out, I uploaded the big VA report from earlier this year.

$ s3cmd put data/va/2014/14-02603-267/* s3://unitedstates-data/inspectors-general/data/va/2014/14-02603-267/

WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
14-02603-267/report.json -> s3://unitedstates-data/inspectors-general/data/va/2014/14-02603-267/report.json  [1 of 3]
 6800 of 6800   100% in    2s     2.69 kB/s  done
14-02603-267/report.pdf -> s3://unitedstates-data/inspectors-general/data/va/2014/14-02603-267/report.pdf  [2 of 3]
 1574313 of 1574313   100% in    3s   470.92 kB/s  done
14-02603-267/report.txt -> s3://unitedstates-data/inspectors-general/data/va/2014/14-02603-267/report.txt  [3 of 3]
 474149 of 474149   100% in    2s   203.02 kB/s  done

Which made this:

https://archive.org/download/unitedstates-data/inspectors-general/data/va/2014/14-02603-267/

Interestingly, 10 minutes after upload, the Internet Archive auto-produced a report_jp2.zip (~70MB) that contains JPG-like images of each of the pages of the original PDF.

I've sent an email to the Archive asking for guidance or documentation on how we can best structure the collection. In the meantime, I may just upload everything now, once, and worry about creating a sophisticated script for managing cost-effective sync and re-uploading of metadata later.

konklone commented 10 years ago

Also, the Internet Archive has absolutely insane public logging for all this.

konklone commented 10 years ago

Some more pages relevant to our collection:

manager

editor

okay

The Internet Archive is extremely cool.

waldoj commented 10 years ago

I'm glad that storing this stuff on the Archive is going so well. It's really the perfect home for this stuff.

Well, I mean, a .gov site is the perfect home for this. But, short of that, the Archive is the best home for this.

konklone commented 10 years ago

The plan now, after talking with IA, is to store each report as an "item" in the collection, rather than putting them in one bucket. An "item" (bucket) is supposed to have the same metadata for everything. Currently, the unitedstates-data bucket is considered one "item":

https://archive.org/details/unitedstates-data

The Archive is willing to make a Collection for the items, but needs at least 50 "items" uploaded.

So each item would be its own bucket, with an ID something like unitedstates-inspector-general-EPA-2004-[report-id]? And then I would ask those to be categorized. The IDs seem unwieldy, but I think that's the only way, at least to start.

FWIW, not having resolved @divergentdave's work on finding duplicate IDs across years wouldn't come into play here, if I put the year in the ID. The year is a more brittle piece of data than I'd prefer to put in the ID of the item, though.

divergentdave commented 10 years ago

It seems like Internet Archive item identifiers can't easily be changed once uploaded. (or deleted, naturally) If we use our report_id in the item identifier, we'll need to make sure we've fixed all our outstanding QA issues before we start uploading. (particularly same-year duplicate IDs and bad 404 pages, maybe duplicate files)

It might also be a good idea to manually review new reports going forward before sending them to IA, in case one of the scrapers starts emitting spurious reports.

konklone commented 10 years ago

I agree that we should get our ID QA in order before submitting everything...and you've basically done that, which is outstanding. I do need to regenerate my archive.

But, I think the downside of uploading duplicate or wrongly ID'd content to the Archive is low. It'll happen, we'll make a good faith effort to keep it in order (and automating the running of the qa script will make that possible), but it's ultimately just not a big deal to have dupe reports under different IDs. Everything else is overwritable, I think.

konklone commented 10 years ago

For anyone watching this thread, I'm doing some work that will build into a general-purpose Internet Archive uploader, at https://github.com/konklone/bit-voyage.

konklone commented 10 years ago

So Harvard's Perma.cc automatically uploads to the Internet Archive, using ia-wrapper.

def upload_to_internet_archive(self, link_guid):
    # setup
    asset = Asset.objects.get(link_id=link_guid)
    link = asset.link
    identifier = settings.INTERNET_ARCHIVE_IDENTIFIER_PREFIX+link_guid
    warc_path = os.path.join(asset.base_storage_path, asset.warc_capture)

    # create IA item for this capture
    item = internetarchive.get_item(identifier)
    metadata = {
        'collection':settings.INTERNET_ARCHIVE_COLLECTION,
        'mediatype':'web',
        'date':link.creation_timestamp,
        'title':'Perma Capture %s' % link_guid,
        'creator':'Perma.cc',

        # custom metadata
        'submitted_url':link.submitted_url,
        'perma_url':"http://%s/%s" % (settings.HOST, link_guid)
    }

    # upload
    with default_storage.open(warc_path, 'rb') as warc_file:
        success = item.upload(warc_file,
                              metadata=metadata,
                              access_key=settings.INTERNET_ARCHIVE_ACCESS_KEY,
                              secret_key=settings.INTERNET_ARCHIVE_SECRET_KEY,
                              verbose=True,
                              debug=True)
    if success:
        print "Succeeded."
    else:
        print "Failed."
        self.retry(exc=Exception("Internet Archive reported upload failure."))
konklone commented 10 years ago

A zip viewer for the contents of the bulk data zip I just uploaded: https://ia902205.us.archive.org/zipview.php?zip=/25/items/us-inspectors-general.bulk/us-inspectors-general.bulk.zip

Intended landing page for the bulk data file: https://archive.org/details/us-inspectors-general.bulk

There's no automatic download link for an entire collection, so I'll plan to upload every item in the collection individually, and then upload a bulk file separately.

I have an individual report uploaded and successfully rendering in the Archive's book viewer here: https://archive.org/details/us-inspectors-general.treasury-2014-OIG-14-023