Closed jmartin-sul closed 1 year ago
Great write-up! Thanks, @jmartin-sul. MoabValidationHandler and PreservedObjectHandler are my big ones. Relevant: https://github.com/sul-dlss/preservation_catalog/pull/1277
one of the things I'm wondering about: I think the problem with refactor MVH and POH is that they are so ... big. Would it be more tractable to either do something smaller, or start with the smallest subset of stuff in the new class(es) and gradually move more and more over???
one of the things I'm wondering about: I think the problem with refactor MVH and POH is that they are so ... big. Would it be more tractable to either do something smaller, or start with the smallest subset of stuff in the new class(es) and gradually move more and more over???
i think so, or at least for POH (i don't think MVH is a ton of code, it just has interactions with consumers that are sometimes surprising, which makes refactoring and using it a pain).
but yeah, for POH, i anticipate something where we start chipping away at obvious refactoring opportunities in individual methods, as opposed to a grand plan for re-arranging it all at once. that feels both easier and less regression prone.
but also, considering the upcoming ManyCats work, i think literally just renaming PreservedObjectHandler
to CompleteMoabHandler
would be a nice start, because i think the latter name would be more accurate. an annoying but mechanical bunch of find/replace work.
and i think annoying but mechanical renamings have a good track record of making this codebase more intelligible (e.g. the class naming discussions we had toward the end of the original work cycle -- i'm glad we did that work when we did).
@jmartin-sul Given what @justinlittman found in google-books yesterday re: Dir.chdir
not being thread-safe, we will need to change some code in prescat before we can make the jump from Resque to Sidekiq (assuming we'll be running Sidekiq in its default multi-threaded mode—it defaults to six threads per process).
DruidVersionZip#create_zip!
uses Dir.chdir
and this method is invoked in a background job.
Dir.chdir(work_dir.to_s) do
combined, status = Open3.capture2e(zip_command)
raise "zipmaker failure #{combined}" unless status.success?
end
to:
combined, status = Open3.capture2e(zip_command, chdir: work_dir.to_s)
raise "zipmaker failure #{combined}" unless status.success?
I scanned sul-dlss for Dir.chdir
and found not much at all. Lots of hits, to be sure, but they're mostly in binstubs, gemspecs, tests, and scripts. So other than this part of prescat, I do not foresee any other thread-safety-related surprises stemming from Dir.chdir
as we make the move to multi-threaded Sidekiq across the board. cc: @sul-dlss/infrastructure-team
3. Make the following, tiny patch (HT @justinlittman: [sul-dlss/google-books#529](https://github.com/sul-dlss/google-books/pull/529)). From:
Dir.chdir(work_dir.to_s) do combined, status = Open3.capture2e(zip_command) raise "zipmaker failure #{combined}" unless status.success? end
to:
combined, status = Open3.capture2e(zip_command, chdir: work_dir.to_s) raise "zipmaker failure #{combined}" unless status.success?
i like option 3. thanks for the research, @mjgiarlo! filed a specific ticket for this: #1519
@jmartin-sul can this EPIC be closed? It's over 2 years old.
@jmartin-sul I'm closing this EPIC that is over 2 years old.
@jmartin-sul I'm closing this EPIC that is over 2 years old.
thanks! seems reasonable. also very happy to see how much of this we ended up getting done!
Here's a meta ticket to start collecting suggestions for making Preservation Catalog easier to develop, more robust, more usable, etc. We can spawn individual actionable tickets from this for our upcoming maintenance work cycle.
Some broad categories that I think might be useful, with some starter ideas for things I know I'd like to improve:
Improvements to make development more pleasant
I suspect that there is now more bifurcation than is necessary among classes that push to and audit our cloud archive storage. This made a bit more sense when we had to have a separate Resque Pool instance for each cloud endpoint, but now that we're using one consolidated Resque Pool instance, there may be opportunity to collapse very similar IBM and AWS classes into consolidated S3 classes.TheMoabValidationHandler
module.Everyone seems to find this difficult to work with, myself included sometimes, and I was the one who originally refactored code into this module. I don't want to copy this logic back out into the various consumers, but feedback seems to be that a more traditional inheritance relationship, or some other form of composition, would be more intelligible than the module mixin.PreservedObjectHandler
CompleteMoabHandler
The most shameless of the shameless green code in this codebase, IMO. It'd be great to further decompose this into shorter and more intelligible methods, with less deeply nested conditional logic.done! or at least much better decomposed by refactorings in Q4 2022I also suspectrename done!CompleteMoabHandler
would be a more appropriate name, considering what it's used for. It's initialized based on a druid and a storage root, which sounds very much like a specific instance on disk of a moab. While it does necessarily tie back to the parent preserved object, and while these object/record types are essentially synonymous in the current implementation, theCompleteMoab
and thePreservedObject
will no longer have a practical 1:1 relationship once we start allowing multiple copies of a given moab on different storage roots, which is a thing we plan to do in the upcoming work cycle (the DB schema already allows for many moabs for a preserved object, but the app code doesn't yet support this -- see #1159, #1190, #1139).Move from Resque to Sidekiq? There seems to be a growing department/team preference for the latter. One reason we didn't use Sidekiq with preservation_robots was that some of its code used (uses?) class variables to cache, which is not threadsafe, and so not safe to use with Sidekiq. I don't think pres catalog has this problem, but since the same people were working on both at the same time, and since we knew Resque a bit better, we just went with Resque for both projects.done! see #1984e.g. get rid ofActiveRecordUtils.process_in_batches
(and thec2m_sql_limit
setting, of which it's the only consumer). I think this method has outlived its usefulness. It's used in just once place,Audit::Checksum.validate_status_root
. If we want to iterate over a large result set in batches while preserving order, we have now included thepostgresql_cursor
gem for that (see usage inMoabStorageRootReporter
). If we don't care so much about order or transactionality, just use ActiveRecord's#find_each
method.Speaking of which, switchAudit::Checksum.validate_status_root
to queue jobs for asynchronous processing of the moabs on a storage root. There are potentially hundreds of thousands of moabs on a given storage root, and I don't think processing them synchronously makes sense when we have a perfectly cromulent queuing system to manage that work..validate_status_root
as currently written is a holdover from when the scheduled validation jobs did their work synchronously instead of via queues (from before queues had been introduced to the app at all). We kept it in part because there was a ton of flux at the time queues were introduced to the app, and this provided us with a comforting fallback that was already tested, in case we the switch to queues for all things long-running didn't work out. We expose this functionality in the README, for manual consumption, but woe to the user who kicks off synchronous validation of an entire production storage root, esp if they aren't using a screen or tmux session. Every time I manually kick off validation of a whole storage root, I write a loop in Rails console to do the work via queues.Functional improvements
A rake task for clearing zip records for improperly replicated moab versions, and attempting to re-push the moab to the cloud. Should probably include a scare "Are you sure?" confirmation. Should probably also do the courtesy of checking the S3 bucket for presence of zip parts and then balking if anything is there, since pres cat (by design) can't just overwrite things that are already pushed. But sometimes cloud uploads fail (e.g. due to network flakinesss), and all that's required is DB record cleanup for something that hasn't gotten pushed at all, and which just needs to go through the replication queue again. My description here is pretty fuzzy, and I'm happy to flesh it out in a separate ticket with links to illustrative old issues if we decide to work on this bullet point.done! (if slightly differently than suggested here, see #1750 and #1733alright, that's enough from me for now.
Please do:
Thanks all!