Accession missing Archive-It content

edsu commented 2 years ago

Based on a conversation between @peterchanws @mjgiarlo and @edsu about the results of #438 we would like to fetch the missing WARC data and perform a one-time accessioning of the files as a set. Based on analysis this task involves:

Write a rake task or command line tool to audit that fetches specific WARC data from Archive-It, only from the collections listed below in a comment from Peter, using the audit results as input
Copy the resulting data to was-registrar-app.stanford.edu:/was_unaccessioned_data/{item-dir-name}

Then Peter will do the following:

Kick off one-time registration using a new SDR collection for the whole set
Ensure that the job finishes and data is available in swap.stanford.edu

If in the process of this it is easy to determine why the content wasn't downloaded the first time that would be a big bonus.

peterchanws commented 2 years ago

Here is the link to the list of collections https://docs.google.com/spreadsheets/d/1dPZBmuh6_NvjCTTCjI8PHfKpRRLXxPi0YxRDj-APhzQ/edit#gid=0

justinlittman commented 2 years ago

I think it will be even easier to just kick off the one-time registration as part of the rake task (so @peterchanws doesn't need to do it). Is that acceptable?

Also, @peterchanws is there a preferred format/structure for the source ids or is that something that you would like to specify for each one-time registration?

edsu commented 2 years ago

@justinlittman for the registration to kick off the rake task will need to be run on the machine where the /was_unaccessioned_data/ share is available. This seems like a reasonable assumption since it will also need access to the WRA db. So yep, if it makes sense to register it as part of the task that is good. We may run into issues where there may not be enough space available (in qa and stage).

peterchanws commented 2 years ago

Thanks, Justin. It will be great if we can add something like "patch-missing" as suffix to differentiate them from the regular registration.

justinlittman commented 2 years ago

Am I correct in assuming that this should respect embargo?

peterchanws commented 2 years ago

Yes.

justinlittman commented 2 years ago

I've re-audited the collections that failed the initial audit. https://docs.google.com/spreadsheets/d/1QRQMsuKp8PO19-oVRx9DbbO3FChUSZzSmr0wipQnNkw/edit#gid=0 has been updated with the results.

In addition, I have analyzed the reasons that WARCs might be missing. The following summarizes the results:

Preceded fetch start month (10 collections)

Collections in WRA are configured with a month/year to begin fetching. These AIT collections contain WARCs that precede the fetch start date.

Are there changes necessary to prevent this from happening?

Crawl spans month (3 collections)

The AIT crawl is configured to be performed at the end of the month; the WARC files are not stored until the next month. In these cases, the fetch happens after the crawl is performed but before the crawl is saved.

To avoid this, fetching should be performed late in the month instead of daily or there should be a minimum 1 month embargo.

Crawl created in one month but saved in next month (2 collections)

I don't fully understand the AIT functionality, but for some manual crawls it seems that performing the crawl is a separate step from saving the crawl. In these cases, the fetch happens after the crawl is performed but before the crawl is saved.

To avoid this, fetching should be performed late in the month instead of daily or there should be a minimum 1 month embargo.

Crawl item is in graveyard for unknown reasons (2 collections)

It is unclear what landed the item in the graveyard. These items do not include any WARCs.

arc.gz are unsupported by WRA (1 collection)

To address this, arc.gz support will need to be added in multiple code locations.

Unexplained (2 collections)

I was unable to determine the reason that the WARCs were omitted.

Single AIT collection is split between 2 SDR collections (1 collection)

edsu commented 2 years ago

It sounds like we may want to adjust the schedule to run fetches nearer the end of the month (the 28th) to avoid some of the issues you've identified here @justinlittman? Shall we create a separate issue for that?

It would be great if the audit tool could report some of these diagnostics about why things may be missing. But I don't know if that's feasible.

peterchanws commented 2 years ago

"Crawl created in one month but saved in next month (2 collections)" We can conduct "Test Crawl" for seeds and decide if we wait to save them within 60 days after the crawl end.

mjgiarlo commented 2 years ago

@peterchanws says we can close this

sul-dlss / was-registrar-app