sul-dlss / common-accessioning

Suite of robots that handle the tasks of accessioning digital objects
Other
2 stars 1 forks source link

Test ocrWF with various kinds of content #1276

Closed peetucket closed 3 months ago

peetucket commented 3 months ago

https://argo-qa.stanford.edu/catalog?f%5Btag_ssim%5D%5B%5D=remediation+%3A+examples

Use the OCR text extraction button in Argo OR put the files on /dor/staging/ somewhere temporary, and then use the integration tests on your laptop by temporarily overriding these values:

 ocr_bundle_directory: '/dor/staging/integration-tests/ocr-test'
 ocr_document_bundle_directory: '/dor/staging/integration-tests/ocr-test-document'

Test expectations will fail due to different files being produced, but they can commented out or ignored, and the tests used as an easy way to automate an accessioning run.

peetucket commented 3 months ago

From Andrew in slack:

I created some more sample data, this time a couple of real-world objects taken from prod that are likely to get the OCR re-run treatment to fix issues/omissions in the original data.

Looking at the file arrangement and structure, these might give us some trouble. For example, this item has an item-level XML file, but it’s a TEI file, not an OCR file. We would not want to overwrite it:
https://argo-qa.stanford.edu/view/druid:nd996cz2164

Since that file doesn’t have sdrGenerated, maybe it will be ok?

The other difference between these items and prod is they have the new Cocina fields since I put them through preassembly QA.
peetucket commented 3 months ago

Ran this one through ocrWF by using the "text extraction" modal in Argo: https://argo-qa.stanford.edu/view/druid:nd996cz2164

Looks like it added a full PDF, a full txt and a page level XML ? Need to see what changed by looking at preservation perhaps (I failed to document what it looked like before I hit the button)

peetucket commented 3 months ago

Ran this one through ocrWF by using the "text extraction" modal in Argo: https://argo-qa.stanford.edu/view/druid:bx975xv5837

Looks like it added a full PDF, a full txt. Page level XML existed before, not sure if it was replaced. Need to see what changed by looking at preservation perhaps?