Audit WARC files - Githubissues

@andrewjbtw and @apchan have reason to believe that there is WARC data at Archive-It that has not been transferred to Stanford.

It would be useful to create a report that uses the WASAPI and/or ArchiveIt APIs to determine if all the WARCs available have been retrieved and accessioned into SDR. It should be possible to get the list of WARC files from ArchiveIt, but it is not immediately clear how to look up these filenames in SDR without direct access to the database.

The filenames appear to be in the structural Cocina metadata. For example look for CDL-20141101134736-00003-grebe.ucop.edu-00536873.arc.gz in https://argo.stanford.edu/items/druid:bb077hj4590.json

One option might be to use Argo as an API to page through all the objects that belong to the Web Archive Crawl Object APO, fetch the Cocina for each item, and then build a lookup table (warc file name -> druid). But Argo requires Shibboleth authentication, which could be problematic.

This could be a command line utility or it could be something that was-registrar-app provides an interface to.

sul-dlss / was-registrar-app

Audit WARC files #438