sul-dlss / argo

The administrative discovery interface for Stanford's Digital Object Registry
Other
21 stars 5 forks source link

Descriptive spreadsheet import should not accept title1.structuredValue1.value without title*.structuredValue*.type #4279

Open andrewjbtw opened 12 months ago

andrewjbtw commented 12 months ago

Describe the bug Descriptive metadata spreadsheet import is allowing imports of spreadsheets where there is a value in title1.structuredValue1.value but no type in title1.structuredValue1.type. This is causing indexing and publishing failures.

Example spreadsheet to reproduce the issue on stage: descriptive-bx947ty9936.csv https://argo-stage.stanford.edu/view/druid:bx947ty9936

User Impact Users should be notified of metadata validation errors at the time of import. Without that, hidden errors cause problems later in accessioning including preventing items from being made accessible on the purl.

Expected behavior Validation should prevent invalid titles from being created.

ndushay commented 11 months ago

this may be because validation is turned off right now.

justinlittman commented 10 months ago

@andrewjbtw Would it be acceptable to just make title generation more resilient to missing types, e.g., by assuming that a missing title type is "title"?

andrewjbtw commented 9 months ago

We turned on type validation and it's still possible to upload titles without a type. I just verified it by following the steps in the ticket.

andrewjbtw commented 9 months ago

I'm guessing that the reason type validation doesn't address this situation is because there's no type at all rather than an invalid type. I tried an invalid type and the single-item upload was able to catch that.

re:

Would it be acceptable to just make title generation more resilient to missing types, e.g., by assuming that a missing title type is "title"?

I'm hesitant to make assumptions about the type because cases where someone is providing a title type are likely to be more complex structured titles. So it could really be "main title" or "subtitle" or another option. It also doesn't look like plain title is a valid type.

lwrubel commented 9 months ago

This is addressed with https://github.com/sul-dlss/cocina-models/pull/675. I'll be bumping cocina-models and releasing, and then uploading a descriptive metadata spreadsheet to confirm validation happens as expected. I'll keep the ticket open until then.

lwrubel commented 9 months ago

I don't think the addition to cocina validation is catching this as expected in a spreadsheet upload. If I upload a spreadsheet with an empty title1.structuredValue1.type, the row seems to be making it past the CocinaValidator in https://github.com/sul-dlss/argo/blob/b58cfc42636a21d4ee52bfbd09bd68d3fe57630b/app/controllers/descriptives_controller.rb#L26

It appears to moves on to display_success and then errors in immediate reindexing: https://app.honeybadger.io/projects/49898/faults/103891642#notice-context

lwrubel commented 9 months ago

There's some background info in this Slack thread about the DescriptionValuesValidator not validating because of the hash keys type. When updating the validator and running against prod dor-services-app, there were 244 druids that were invalid: validate-cocina.csv

@arcadiafalcone, would you take a look at these druids?

arcadiafalcone commented 9 months ago

@lwrubel Could you rerun this validation report and include the FOLIO HRIDs? The first few I looked at were converted from MARC, which means this is at least partly a MARC/XSLT problem. Including the collection druid would also be helpful in grouping similar items together for remediation. Thanks!

lwrubel commented 9 months ago

@arcadiafalcone here's an updated CSV with the FOLIO HRIDs. Hope this helps! validate-cocina-with-hrids.csv

lwrubel commented 9 months ago

I see you also requested the collection druid, and I hadn't grabbed that. I'll still do that and send an update.

lwrubel commented 9 months ago

Thanks for your patience. Here's the report, with collection druids as well:

validate-cocina-hrids-colls-argo-4279.csv

arcadiafalcone commented 9 months ago

Thanks @lwrubel. Initial analysis is that 20 of these are Cocina-native, the rest were converted from MARC.

Six of the SDR ones are from H2 and have the same issue because the app is generating the same date range twice - once as a structuredValue and once as a single value using EDTF date range notation. We should have either one or the other of these. (cc @amyehodge) wr164nz3618 gx970nw8882 hj910dt9988 vs407gj3803 yn038px7249 yy114sp3047

I'll work on fixing the remaining 14 SDR items, and then go spelunking in MARC.

arcadiafalcone commented 9 months ago

For the SDR items, I was able to fix 7, and am contacting the content owners about fixing the other 7.

amyehodge commented 9 months ago

Thanks for the heads up @arcadiafalcone. Let me know if you need anything specific from me. Do we need to file a ticket to make any changes in H2?

lwrubel commented 9 months ago

@amyehodge Might be worth creating a ticket to make 100% sure that we are not still generating invalid dates like this. FWIW, I'm not sure that we are still creating values like these invalid ones when generating the cocina descriptive in H2. I couldn't find that in the DateGenerator code and couldn't generate a similar date value in QA: https://argo-qa.stanford.edu/view/yz466fs4308. But it's possible I missed something.

amyehodge commented 9 months ago

@arcadiafalcone All of these are items that were migrated over from hydrus. One of them is in the Hopkins Marine Station collection and is “owned” by Don Kohrs, who works in the library — this one I last modified to remediate the date. The other 5 are in the Mapping the Republic of Letters collection that is managed by Nicole Coleman. All 5 of these were also migrated from hydrus and also all have the last update in H2 that includes a change to the dates.

Ideally, we'd want the changes to happen in H2 and not just via Argo, so that H2 has the correct metadata.

arcadiafalcone commented 9 months ago

@lwrubel @amyehodge From my memory of the process it makes sense that this would be an artifact of migration and not an issue for new records. Amy, is there a way to redeposit these items from H2 to SDR? I suspect the issue is in generating the Cocina rather than in what is stored internally in H2, so that may be enough to fix the items.

amyehodge commented 9 months ago

@arcadiafalcone All of these can be redeposited in H2, but they all were redeposited with changes to their dates via H2 between Sep 15 and Sep 28, 2022, so I'm not clear that it will help? But I could try with one of them to see perhaps.

amyehodge commented 9 months ago

@arcadiafalcone I've redeposited wr164nz3618 without making any changes. Please check to see if that fixed it. Thanks.

arcadiafalcone commented 9 months ago

@amyehodge It looks like that worked!

amyehodge commented 9 months ago

Oh, sweet. Okay. I'll do the others as well.

amyehodge commented 9 months ago

@arcadiafalcone These have all been redeposited. Please let me know if these all went through okay. When I'm sure we're done messing with them, I'll email the content owners to let them know.

arcadiafalcone commented 9 months ago

@amyehodge Looks like they all updated successfully!

amyehodge commented 9 months ago

Users have been notified of the changes for these 6 items.

arcadiafalcone commented 8 months ago

@lwrubel @andrewjbtw 138 of these items are Japanese military maps with records coming from MARC. I registered one object in stage with the HRID and the cocina looks OK for that item. What I suspect is that there was a bug in the MODS-cocina transformation at the time these records were created that has since been fixed.

Unfortunately these are digital serials records so I can't just refresh them all from source without redoing all the digital serials labeling. Is there a way to take the existing MODS (which is correct and has the labels) and use it to regenerate the cocina? It would also be helpful to convert the existing MODS created with an old XSLT to cocina for a sample record? This would allow me to confirm that it's not the old MARC2MODS XSLT that is the issue (which I highly doubt because this is error is not possible to do in MODS, but any weirdness is possible with XSLT).

Sample item: wr904cm7589 (and hj735vc4271 on argo-test)

Note to myself because this is complicated: Production: MARC converted to MODS 3.4 with old XSLT, labels added, MODS converted to cocina probably at migration and not updated since Test: MARC converted to MODS 3.7 with current XSLT, MODS converted to cocina with current code