jmartin-sul commented 4 years ago

Here's a meta ticket to start collecting suggestions for making Preservation Catalog easier to develop, more robust, more usable, etc. We can spawn individual actionable tickets from this for our upcoming maintenance work cycle.

Some broad categories that I think might be useful, with some starter ideas for things I know I'd like to improve:

Improvements to make development more pleasant

De-duplication, copypasta reduction
- I suspect that there is now more bifurcation than is necessary among classes that push to and audit our cloud archive storage. This made a bit more sense when we had to have a separate Resque Pool instance for each cloud endpoint, but now that we're using one consolidated Resque Pool instance, there may be opportunity to collapse very similar IBM and AWS classes into consolidated S3 classes.
~~The MoabValidationHandler module.~~
- Everyone seems to find this difficult to work with, myself included sometimes, and I was the one who originally refactored code into this module. I don't want to copy this logic back out into the various consumers, but feedback seems to be that a more traditional inheritance relationship, or some other form of composition, would be more intelligible than the module mixin.
~~PreservedObjectHandler~~ CompleteMoabHandler
- ~~The most shameless of the shameless green code in this codebase, IMO. It'd be great to further decompose this into shorter and more intelligible methods, with less deeply nested conditional logic.~~ done! or at least much better decomposed by refactorings in Q4 2022
- I also suspect CompleteMoabHandler would be a more appropriate name, considering what it's used for. It's initialized based on a druid and a storage root, which sounds very much like a specific instance on disk of a moab. While it does necessarily tie back to the parent preserved object, and while these object/record types are essentially synonymous in the current implementation, the CompleteMoab and the PreservedObject will no longer have a practical 1:1 relationship once we start allowing multiple copies of a given moab on different storage roots, which is a thing we plan to do in the upcoming work cycle (the DB schema already allows for many moabs for a preserved object, but the app code doesn't yet support this -- see #1159, #1190, #1139). rename done!
Move from Resque to Sidekiq? There seems to be a growing department/team preference for the latter. One reason we didn't use Sidekiq with preservation_robots was that some of its code used (uses?) class variables to cache, which is not threadsafe, and so not safe to use with Sidekiq. I don't think pres catalog has this problem, but since the same people were working on both at the same time, and since we knew Resque a bit better, we just went with Resque for both projects. done! see #1984
General cruft removal and rough edge sanding
- e.g. get rid of ActiveRecordUtils.process_in_batches (and the c2m_sql_limit setting, of which it's the only consumer). I think this method has outlived its usefulness. It's used in just once place, Audit::Checksum.validate_status_root. If we want to iterate over a large result set in batches while preserving order, we have now included the postgresql_cursor gem for that (see usage in MoabStorageRootReporter). If we don't care so much about order or transactionality, just use ActiveRecord's #find_each method.
- Speaking of which, switch Audit::Checksum.validate_status_root to queue jobs for asynchronous processing of the moabs on a storage root. There are potentially hundreds of thousands of moabs on a given storage root, and I don't think processing them synchronously makes sense when we have a perfectly cromulent queuing system to manage that work. .validate_status_root as currently written is a holdover from when the scheduled validation jobs did their work synchronously instead of via queues (from before queues had been introduced to the app at all). We kept it in part because there was a ton of flux at the time queues were introduced to the app, and this provided us with a comforting fallback that was already tested, in case we the switch to queues for all things long-running didn't work out. We expose this functionality in the README, for manual consumption, but woe to the user who kicks off synchronous validation of an entire production storage root, esp if they aren't using a screen or tmux session. Every time I manually kick off validation of a whole storage root, I write a loop in Rails console to do the work via queues.

Functional improvements

Better reporting of audit info. Sending info to the event service is a definite improvement, but there's opportunity for more.
- Something simple like a weekly summary email of outstanding audit errors to the repository manager (and whoever else is interested) has been discussed as a possibly useful feature.
- Using workflow service only for its intended purpose of indicating whether subsequent accessioning steps can proceed. Instead of abusing it as a reporting mechanism. (there are both outstanding and implemented tickets related to this that we can link if we want to bite this off)
- More reporting to the events service? E.g. reporting things like a particular type of validation running at all, as opposed to just reporting when the status changes on a pres catalog record, so that we have a record of the audits running at the expected intervals (or evidence that they aren't, if we were in a situation where we suspected something to be failing on that front).
- Make some of the canned queries from the DB README into formalized and tested methods on relevant classes, to make running those queries from the console easier.
- Better yet, add some simple REST methods that run those queries and expose them on a web page for easy consumption outside of a Rails console session. If those routes were to be consumed by humans from a browser, would also require carveouts to the API token restriction for said URLs (and of course gating with shib/webauth to keep them locked down).
A rake task for clearing zip records for improperly replicated moab versions, and attempting to re-push the moab to the cloud. Should probably include a scare "Are you sure?" confirmation. Should probably also do the courtesy of checking the S3 bucket for presence of zip parts and then balking if anything is there, since pres cat (by design) can't just overwrite things that are already pushed. But sometimes cloud uploads fail (e.g. due to network flakinesss), and all that's required is DB record cleanup for something that hasn't gotten pushed at all, and which just needs to go through the replication queue again. My description here is pretty fuzzy, and I'm happy to flesh it out in a separate ticket with links to illustrative old issues if we decide to work on this bullet point. done! (if slightly differently than suggested here, see #1750 and #1733

alright, that's enough from me for now.

Please do:

add things i didn't think of: i know @mjgiarlo, @justinlittman, and @aaron-collier have all done pres cat work outside the original larger pres cat work cycle, so i'm eager to hear their suggestions as people who had to work on the codebase after it's initial development had settled down.
correct any of the above that you think is off base. by that i mean point out mistakes i may have made in my descriptions (so pulling in @ndushay and @tallenaz too, as devs from the original main WC), but also pushing back on suggestions i've made above that might not actually be such good ideas.

Thanks all!

mjgiarlo commented 4 years ago

Great write-up! Thanks, @jmartin-sul. MoabValidationHandler and PreservedObjectHandler are my big ones. Relevant: https://github.com/sul-dlss/preservation_catalog/pull/1277

ndushay commented 4 years ago

one of the things I'm wondering about: I think the problem with refactor MVH and POH is that they are so ... big. Would it be more tractable to either do something smaller, or start with the smallest subset of stuff in the new class(es) and gradually move more and more over???

jmartin-sul commented 4 years ago

one of the things I'm wondering about: I think the problem with refactor MVH and POH is that they are so ... big. Would it be more tractable to either do something smaller, or start with the smallest subset of stuff in the new class(es) and gradually move more and more over???

i think so, or at least for POH (i don't think MVH is a ton of code, it just has interactions with consumers that are sometimes surprising, which makes refactoring and using it a pain).

but yeah, for POH, i anticipate something where we start chipping away at obvious refactoring opportunities in individual methods, as opposed to a grand plan for re-arranging it all at once. that feels both easier and less regression prone.

but also, considering the upcoming ManyCats work, i think literally just renaming PreservedObjectHandler to CompleteMoabHandler would be a nice start, because i think the latter name would be more accurate. an annoying but mechanical bunch of find/replace work.

jmartin-sul commented 4 years ago

and i think annoying but mechanical renamings have a good track record of making this codebase more intelligible (e.g. the class naming discussions we had toward the end of the original work cycle -- i'm glad we did that work when we did).

mjgiarlo commented 4 years ago

@jmartin-sul Given what @justinlittman found in google-books yesterday re: Dir.chdir not being thread-safe, we will need to change some code in prescat before we can make the jump from Resque to Sidekiq (assuming we'll be running Sidekiq in its default multi-threaded mode—it defaults to six threads per process).

Problem

DruidVersionZip#create_zip! uses Dir.chdir and this method is invoked in a background job.

Solutions

Stick with Resque. Meh.
Configure Sidekiq to run one thread per process. Meh.
Make the following, tiny patch (HT @justinlittman: https://github.com/sul-dlss/google-books/pull/529). From:

    Dir.chdir(work_dir.to_s) do
      combined, status = Open3.capture2e(zip_command)
      raise "zipmaker failure #{combined}" unless status.success?
    end

to:

    combined, status = Open3.capture2e(zip_command, chdir: work_dir.to_s)
    raise "zipmaker failure #{combined}" unless status.success?

FWIW

I scanned sul-dlss for Dir.chdir and found not much at all. Lots of hits, to be sure, but they're mostly in binstubs, gemspecs, tests, and scripts. So other than this part of prescat, I do not foresee any other thread-safety-related surprises stemming from Dir.chdir as we make the move to multi-threaded Sidekiq across the board. cc: @sul-dlss/infrastructure-team

jmartin-sul commented 4 years ago

3. Make the following, tiny patch (HT @justinlittman: [sul-dlss/google-books#529](https://github.com/sul-dlss/google-books/pull/529)). From:

    Dir.chdir(work_dir.to_s) do
      combined, status = Open3.capture2e(zip_command)
      raise "zipmaker failure #{combined}" unless status.success?
    end

to:

    combined, status = Open3.capture2e(zip_command, chdir: work_dir.to_s)
    raise "zipmaker failure #{combined}" unless status.success?

i like option 3. thanks for the research, @mjgiarlo! filed a specific ticket for this: #1519

ndushay commented 1 year ago

@jmartin-sul can this EPIC be closed? It's over 2 years old.

ndushay commented 1 year ago

@jmartin-sul I'm closing this EPIC that is over 2 years old.

jmartin-sul commented 1 year ago

@jmartin-sul I'm closing this EPIC that is over 2 years old.

thanks! seems reasonable. also very happy to see how much of this we ended up getting done!

sul-dlss / preservation_catalog

[EPIC] pres cat maintenance #1487

Improvements to make development more pleasant

Functional improvements

Problem

Solutions

FWIW