sul-dlss / dor-services-app

A Rails application exposing Digital Object Registry functions as a RESTful HTTP API
https://sul-dlss.github.io/dor-services-app/
Other
3 stars 2 forks source link

Reconsider WARC shelving #5046

Closed justinlittman closed 2 months ago

justinlittman commented 3 months ago

Currently, WARCs are shelved as part of the shelving accessioning workflow step which is performed by DSA's Shelving Service. With the new Purl Fetcher publish / shelve API, WARCs are the only files that will be shelved using the legacy approach.

To be considered:

lwrubel commented 2 months ago

While it could make sense to shelve WARCs within wasCrawlDisseminationWF (just as we extract WARCs to stacks from WACZ files in that WF), the workspace will already have been reset at the end of accessionWF and the WARC/WACZ files would no longer be available. wasCrawlDisseminationWF works with files already on web-archiving-stacks.

As for the logic for shelving, we could simply shelve whatever is in the cocina structural with shelve=yes and available as a file in the workspace.

Changed files? I think it's been very rare (ever?) that we've changed a WARC file. The last time we had to change a WARC file that I'm aware of was to support the reconstruction of the SLAC website and we accessioned a new object with the changed WARCs. So I'm not sure that diffing digests is necessary to determine whether or not to shelve something again.

Removing from stacks? we don't currently have a process for "unshelving" web archives. Everything that gets shelved gets indexed in the CDXJ indexes and there is no process to delete from indexes or delete from the web archiving stacks (aside from the step where a WACZ is deleted once the WARCs are extracted).

Would be interested in thoughts from @edsu and @andrewjbtw.