Closed peetucket closed 3 months ago
From Andrew in slack:
I created some more sample data, this time a couple of real-world objects taken from prod that are likely to get the OCR re-run treatment to fix issues/omissions in the original data.
Looking at the file arrangement and structure, these might give us some trouble. For example, this item has an item-level XML file, but it’s a TEI file, not an OCR file. We would not want to overwrite it:
https://argo-qa.stanford.edu/view/druid:nd996cz2164
Since that file doesn’t have sdrGenerated, maybe it will be ok?
The other difference between these items and prod is they have the new Cocina fields since I put them through preassembly QA.
Ran this one through ocrWF by using the "text extraction" modal in Argo: https://argo-qa.stanford.edu/view/druid:nd996cz2164
Looks like it added a full PDF, a full txt and a page level XML ? Need to see what changed by looking at preservation perhaps (I failed to document what it looked like before I hit the button)
Ran this one through ocrWF by using the "text extraction" modal in Argo: https://argo-qa.stanford.edu/view/druid:bx975xv5837
Looks like it added a full PDF, a full txt. Page level XML existed before, not sure if it was replaced. Need to see what changed by looking at preservation perhaps?
https://argo-qa.stanford.edu/catalog?f%5Btag_ssim%5D%5B%5D=remediation+%3A+examples
Use the OCR text extraction button in Argo OR put the files on
/dor/staging/
somewhere temporary, and then use the integration tests on your laptop by temporarily overriding these values:Test expectations will fail due to different files being produced, but they can commented out or ignored, and the tests used as an easy way to automate an accessioning run.