Open andrewjbtw opened 12 months ago
this may be because validation is turned off right now.
@andrewjbtw Would it be acceptable to just make title generation more resilient to missing types, e.g., by assuming that a missing title type is "title"?
We turned on type validation and it's still possible to upload titles without a type. I just verified it by following the steps in the ticket.
I'm guessing that the reason type validation doesn't address this situation is because there's no type at all rather than an invalid type. I tried an invalid type and the single-item upload was able to catch that.
re:
Would it be acceptable to just make title generation more resilient to missing types, e.g., by assuming that a missing title type is "title"?
I'm hesitant to make assumptions about the type because cases where someone is providing a title type are likely to be more complex structured titles. So it could really be "main title" or "subtitle" or another option. It also doesn't look like plain title
is a valid type.
This is addressed with https://github.com/sul-dlss/cocina-models/pull/675. I'll be bumping cocina-models and releasing, and then uploading a descriptive metadata spreadsheet to confirm validation happens as expected. I'll keep the ticket open until then.
I don't think the addition to cocina validation is catching this as expected in a spreadsheet upload. If I upload a spreadsheet with an empty title1.structuredValue1.type
, the row seems to be making it past the CocinaValidator
in https://github.com/sul-dlss/argo/blob/b58cfc42636a21d4ee52bfbd09bd68d3fe57630b/app/controllers/descriptives_controller.rb#L26
It appears to moves on to display_success
and then errors in immediate reindexing: https://app.honeybadger.io/projects/49898/faults/103891642#notice-context
There's some background info in this Slack thread about the DescriptionValuesValidator
not validating because of the hash keys type. When updating the validator and running against prod dor-services-app, there were 244 druids that were invalid:
validate-cocina.csv
@arcadiafalcone, would you take a look at these druids?
@lwrubel Could you rerun this validation report and include the FOLIO HRIDs? The first few I looked at were converted from MARC, which means this is at least partly a MARC/XSLT problem. Including the collection druid would also be helpful in grouping similar items together for remediation. Thanks!
@arcadiafalcone here's an updated CSV with the FOLIO HRIDs. Hope this helps! validate-cocina-with-hrids.csv
I see you also requested the collection druid, and I hadn't grabbed that. I'll still do that and send an update.
Thanks for your patience. Here's the report, with collection druids as well:
Thanks @lwrubel. Initial analysis is that 20 of these are Cocina-native, the rest were converted from MARC.
Six of the SDR ones are from H2 and have the same issue because the app is generating the same date range twice - once as a structuredValue and once as a single value using EDTF date range notation. We should have either one or the other of these. (cc @amyehodge) wr164nz3618 gx970nw8882 hj910dt9988 vs407gj3803 yn038px7249 yy114sp3047
I'll work on fixing the remaining 14 SDR items, and then go spelunking in MARC.
For the SDR items, I was able to fix 7, and am contacting the content owners about fixing the other 7.
Thanks for the heads up @arcadiafalcone. Let me know if you need anything specific from me. Do we need to file a ticket to make any changes in H2?
@amyehodge Might be worth creating a ticket to make 100% sure that we are not still generating invalid dates like this. FWIW, I'm not sure that we are still creating values like these invalid ones when generating the cocina descriptive in H2. I couldn't find that in the DateGenerator
code and couldn't generate a similar date value in QA: https://argo-qa.stanford.edu/view/yz466fs4308. But it's possible I missed something.
@arcadiafalcone All of these are items that were migrated over from hydrus. One of them is in the Hopkins Marine Station collection and is “owned” by Don Kohrs, who works in the library — this one I last modified to remediate the date. The other 5 are in the Mapping the Republic of Letters collection that is managed by Nicole Coleman. All 5 of these were also migrated from hydrus and also all have the last update in H2 that includes a change to the dates.
Ideally, we'd want the changes to happen in H2 and not just via Argo, so that H2 has the correct metadata.
@lwrubel @amyehodge From my memory of the process it makes sense that this would be an artifact of migration and not an issue for new records. Amy, is there a way to redeposit these items from H2 to SDR? I suspect the issue is in generating the Cocina rather than in what is stored internally in H2, so that may be enough to fix the items.
@arcadiafalcone All of these can be redeposited in H2, but they all were redeposited with changes to their dates via H2 between Sep 15 and Sep 28, 2022, so I'm not clear that it will help? But I could try with one of them to see perhaps.
@arcadiafalcone I've redeposited wr164nz3618 without making any changes. Please check to see if that fixed it. Thanks.
@amyehodge It looks like that worked!
Oh, sweet. Okay. I'll do the others as well.
@arcadiafalcone These have all been redeposited. Please let me know if these all went through okay. When I'm sure we're done messing with them, I'll email the content owners to let them know.
@amyehodge Looks like they all updated successfully!
Users have been notified of the changes for these 6 items.
@lwrubel @andrewjbtw 138 of these items are Japanese military maps with records coming from MARC. I registered one object in stage with the HRID and the cocina looks OK for that item. What I suspect is that there was a bug in the MODS-cocina transformation at the time these records were created that has since been fixed.
Unfortunately these are digital serials records so I can't just refresh them all from source without redoing all the digital serials labeling. Is there a way to take the existing MODS (which is correct and has the labels) and use it to regenerate the cocina? It would also be helpful to convert the existing MODS created with an old XSLT to cocina for a sample record? This would allow me to confirm that it's not the old MARC2MODS XSLT that is the issue (which I highly doubt because this is error is not possible to do in MODS, but any weirdness is possible with XSLT).
Sample item: wr904cm7589 (and hj735vc4271 on argo-test)
Note to myself because this is complicated: Production: MARC converted to MODS 3.4 with old XSLT, labels added, MODS converted to cocina probably at migration and not updated since Test: MARC converted to MODS 3.7 with current XSLT, MODS converted to cocina with current code
Describe the bug Descriptive metadata spreadsheet import is allowing imports of spreadsheets where there is a value in
title1.structuredValue1.value
but no type intitle1.structuredValue1.type
. This is causing indexing and publishing failures.Example spreadsheet to reproduce the issue on stage: descriptive-bx947ty9936.csv https://argo-stage.stanford.edu/view/druid:bx947ty9936
User Impact Users should be notified of metadata validation errors at the time of import. Without that, hidden errors cause problems later in accessioning including preventing items from being made accessible on the purl.
Expected behavior Validation should prevent invalid titles from being created.