Offer a "light" mode - Githubissues

mjordan commented 4 years ago

Over time, large repositories can accumulate millions of fixity check events. For example, 100,000 resources each checked twice a year would result in 200,000 events in the first year. Over 5 years, assuming no additional new resources, the number of events would rise to 1,000,000.

Repository managers who are only interested in failed events, and in retaining the initial and most recent event for a given resource, may want to reduce storage requirements and database size by only retaining those events and discarding all intermediate successful fixity check events. Doing this, and assuming no failed events or newly added resources that fall within Riprap's catchment criteria, would maintain the number of stored events over 5 years at 200,000.

Failed events should never be discarded, since their purpose is to detect and track hardware and software failures, tampering, accidental deletion, etc.

To accommodate this type of fixity event retention, Riprap should offer a (optional) "light" mode. One implementation of this could be that when a resource is checked, before its current event is persisted to the database, all existing successful events other than the initial (earliest) event should be deleted before the current event is persisted. On the next fixity check, we follow the same pattern - all existing successful events are deleted prior to persisting the current event. If the current event is a failed event, we also follow the same pattern - all existing successful events other than the first are deleted before persisting the failed event.

Tagging @ajstanley, @seth-shaw-unlv, @elizoller, @whikloj for sanity checks on this idea. Would welcome feedback from everyone however.

seth-shaw-unlv commented 4 years ago

I think this is a good idea, although I would purge the older successful events after I persist the most recent one in the off-chance that something fails between the event deletion and the new event persisting.

mjordan commented 4 years ago

Yes, I thought of that too. Or maybe wrap the whole operation in a transaction?

ajstanley commented 4 years ago

That's pretty much what we decided at UPEI -

elizoller commented 4 years ago

@staceyerdman what are your thoughts?

mjordan commented 4 years ago

Another advantage of using this option is that you could run your fixity checks more often without expending more storage space.

staceyerdman commented 4 years ago

@staceyerdman what are your thoughts?

While I can certainly understand the desire to keep the size of the database down, I am concerned about losing the true audit trail of fixity checking events. Most digital preservation audit frameworks like TRAC, or assessment best practices (like the NDSA Levels) suggest keeping logs of all fixity events. The logs are part of the overall provenance of the object, and I would not be in favor of losing that. Sacrificing the frequency of fixity checks is an okay trade-off as far as I am concerned, as research into fixity checking procedures and data corruption/loss has not really shown that super frequent checking is necessary. However, this is just my two cents. 😊 I think developing this as an option is a good idea, but I would prefer it not be the default.

mjordan commented 4 years ago

@staceyerdman thanks very much. I was not suggesting this to be the default but rather a configuration option that site admins would need to turn on. I agree that a robust audit trail would include very fixity check event. The tradoffs of enabling this option would be fully documented.

staceyerdman commented 4 years ago

@mjordan right on! I think it's great to build this kind of flexibility into the tool. I was mostly replying to @elizoller and thinking about our own local use case.

mjordan commented 4 years ago

@staceyerdman you might be interested in the Islandora PREMIS module I started over the holidays. It expresses fixity events into PREMIS RDF. Needs lots of work, but it's a start: https://github.com/mjordan/islandora_premis

mjordan commented 4 years ago

Implemented with 638642db940bdd5b30fbf5aefa1ab65d5da04d52 and documented in the README as "Thin mode". Thin mode is only activated by putting thin: true in your config file, so existing configurations are not impacted.

mjordan / riprap

Offer a "light" mode #63