Closed edsu closed 2 years ago
Here is the link to the list of collections https://docs.google.com/spreadsheets/d/1dPZBmuh6_NvjCTTCjI8PHfKpRRLXxPi0YxRDj-APhzQ/edit#gid=0
I think it will be even easier to just kick off the one-time registration as part of the rake task (so @peterchanws doesn't need to do it). Is that acceptable?
Also, @peterchanws is there a preferred format/structure for the source ids or is that something that you would like to specify for each one-time registration?
@justinlittman for the registration to kick off the rake task will need to be run on the machine where the /was_unaccessioned_data/
share is available. This seems like a reasonable assumption since it will also need access to the WRA db. So yep, if it makes sense to register it as part of the task that is good. We may run into issues where there may not be enough space available (in qa and stage).
Thanks, Justin. It will be great if we can add something like "patch-missing" as suffix to differentiate them from the regular registration.
Am I correct in assuming that this should respect embargo?
Yes.
I've re-audited the collections that failed the initial audit. https://docs.google.com/spreadsheets/d/1QRQMsuKp8PO19-oVRx9DbbO3FChUSZzSmr0wipQnNkw/edit#gid=0 has been updated with the results.
In addition, I have analyzed the reasons that WARCs might be missing. The following summarizes the results:
Collections in WRA are configured with a month/year to begin fetching. These AIT collections contain WARCs that precede the fetch start date.
Are there changes necessary to prevent this from happening?
The AIT crawl is configured to be performed at the end of the month; the WARC files are not stored until the next month. In these cases, the fetch happens after the crawl is performed but before the crawl is saved.
To avoid this, fetching should be performed late in the month instead of daily or there should be a minimum 1 month embargo.
I don't fully understand the AIT functionality, but for some manual crawls it seems that performing the crawl is a separate step from saving the crawl. In these cases, the fetch happens after the crawl is performed but before the crawl is saved.
To avoid this, fetching should be performed late in the month instead of daily or there should be a minimum 1 month embargo.
It is unclear what landed the item in the graveyard. These items do not include any WARCs.
To address this, arc.gz support will need to be added in multiple code locations.
I was unable to determine the reason that the WARCs were omitted.
It sounds like we may want to adjust the schedule to run fetches nearer the end of the month (the 28th) to avoid some of the issues you've identified here @justinlittman? Shall we create a separate issue for that?
It would be great if the audit tool could report some of these diagnostics about why things may be missing. But I don't know if that's feasible.
"Crawl created in one month but saved in next month (2 collections)" We can conduct "Test Crawl" for seeds and decide if we wait to save them within 60 days after the crawl end.
@peterchanws says we can close this
Based on a conversation between @peterchanws @mjgiarlo and @edsu about the results of #438 we would like to fetch the missing WARC data and perform a one-time accessioning of the files as a set. Based on analysis this task involves:
was-registrar-app.stanford.edu:/was_unaccessioned_data/{item-dir-name}
Then Peter will do the following:
If in the process of this it is easy to determine why the content wasn't downloaded the first time that would be a big bonus.