sul-dlss / dor-services-app

A Rails application exposing Digital Object Registry functions as a RESTful HTTP API
https://sul-dlss.github.io/dor-services-app/
Other
3 stars 2 forks source link

allow release workflow to run on items with multiple Project tags #3004

Closed andrewjbtw closed 2 years ago

andrewjbtw commented 3 years ago

The releaseWF is running into errors when an item has more than one Project tag. I attempted to run it on an item that had two Project tags (note, I've since removed one of the tags on this specific druid) and saw this error in the log:

2021-08-12 22:29:58 UTC Beginning ReleaseObjectJob for druid:hy473sz0152
2021-08-12 22:29:58 UTC Adding release tag for Searchworks
2021-08-12 22:29:59 UTC Release tag failed POST Dor::Services::Client::UnexpectedResponse Internal Server Error: 500 ({"status":500,"error":"Internal Server Error"})
2021-08-12 22:29:59 UTC Finished ReleaseObjectJob for BulkAction 15855

It looks like this is the corresponding HB error: https://app.honeybadger.io/projects/50568/faults/80794127

User impact The digital serials remediation workflow has long involved adding the tag of "Project : Digital Serials Remediation" to all items that have been modified to work with the digital serials model. If an item already has a tag, and many serials are deposited through H2/Hydrus, then the digital serials tag is added in addition to the existing one. Hundreds of items are already double-tagged like this and have been successfully released to SW.

Since the entire goal of the digital serials workflow is to make the SearchWorks display of digitized serials more user-friendly, if the releaseWF does not work for some digital serials, then that is a near-blocker for that workflow. I say near-blocker because one option we could recommend is to remove the "Project : H2" tag from now on when adding the digital serials project tag. However, we know that we eventually need to support multiple Project tags, not just for this workflow.

Additional information

The "Project : Digital Serials Remediation" tag exists so that metadata staff can easily identify items whose MODS have been modified for the digital serials model. This is necessary because our refresh of descriptive metadata function will overwrite the current MODS with MARC data, so digital serials need special handling when doing a metadata refresh. If we had another way of handling the digital serials model, which did not make metadata vulnerable to being overwritten this way, it would perhaps not be necessary to have this tag. But that still wouldn't address other cases with two Project tags.

jcoyne commented 3 years ago

two weeks ago @justinlittman and I added an error so that it would be an error condition when detecting an object that had more than one project tags (https://github.com/sul-dlss/dor-services-app/pull/2982) Prior to this, we had encountered a situation where an object with more than two project tags would cause the system to go into an infinite loop, because the existing code expected that only one project tag was present. This infinite loop caused the system to crash.

I don't believe that we can work our way out of this assumption in a "first responder" level of change. This should be scheduled for a maintenance work cycle as it will involve changes to the cocina data model and will involve rewriting the tag code, which has some complicated multi-thread logic at present. We also use the single project tag as a routing key for RabbitMQ so that H2 only needs to look at messages for objects that affect objects originating in H2.

The expeditious way out of the current situation is to update the affected "Project : Digital Serials Remediation" objects to remove their "Project : H2" tag.

andrewjbtw commented 3 years ago

Isn't the infinite loop the result of H2 going into production? We'd never seen it before until someone tried to version an object with two Project tags via H2. In which case, this is a regression brought in by a current workcycle that, unfortunately, is just about to end.

We can't solve this problem by removing the "Project : H2" tag from "Project : Digital Serials Remediation" items because "Project : H2" isn't the only Project tag found alongside the serials project tag. There are around 30 different Project tags associated with the serials project. It's a long-established workflow. https://argo.stanford.edu/catalog?f%5Bexploded_tag_ssim%5D%5B%5D=Project+%3A+Digital+Serials+Remediation

jcoyne commented 3 years ago

Isn't the infinite loop the result of H2 going into production? We'd never seen it before until someone tried to version an object with two Project tags via H2. In which case, this is a regression brought in by a current workcycle that, unfortunately, is just about to end.

The infinite loop started when Amy applied the "Project : H2" tag in bulk to the migrated objects.

We can't solve this problem by removing the "Project : H2" tag from "Project : Digital Serials Remediation" items because "Project : H2" isn't the only Project tag found alongside the serials project tag. There are around 30 different Project tags associated with the serials project. It's a long-established workflow. https://argo.stanford.edu/catalog?f%5Bexploded_tag_ssim%5D%5B%5D=Project+%3A+Digital+Serials+Remediation

That is helpful to know. Would it be possible to rename those tags to something besides "Project : ..." ? Would it be possible to add a namespace (e.g. "Project : Digital Serials : ...") Unfortunately if we can't resolve this with a data manipulation, I don't foresee a way to resolve this during current work cycle, given the short time remaining, people on vacation and attending to other concerns (Mike on access team, LD4P planning, etc.)

andrewjbtw commented 3 years ago

Potentially, we could rename the serials tag itself. I can't think of another solution because the other tags are often the "primary" ones and the serials tag is secondary.

andrewjbtw commented 3 years ago

In terms of the infinite loop, I don't think the problem was the bulk tagging alone:

So I think the infinite loop comes from:

jmartin-sul commented 3 years ago

met with @andrewjbtw, @jcoyne, and @justinlittman this morning about this issue (corrections welcome if i got any of the below notes wrong).

cocina model currently assumes exactly one "project" (partOfProject field), persisted as a Project : project name tag in the DSA DB. but we believe that the only things that act on the project tag programmatically are

see:

prior to the beginning of our transition to cocina-models, the "project" tag was just another tag, and there was (almost) no special handling of the tag, other than a "Project" field in the argo registration form (data from which would be persisted as a Project : name style tag).

note also that only project tags of the style Project : name are treated specially. tags with more than that level of hierarchy are not seen as special by SDR (e.g. Project : grouping : subgrouping is treated like any other tag, not as a "project" tag).

tentative proposal:

jmartin-sul commented 3 years ago

we also think this is likely too large a chunk of work to just have a first responder knock out.

andrewjbtw commented 2 years ago

We cover this now with the restored support for multiple Project tags.