sul-dlss / common-accessioning

Suite of robots that handle the tasks of accessioning digital objects
Other
2 stars 1 forks source link

Ensure we replace files correctly #1282

Closed peetucket closed 3 weeks ago

peetucket commented 4 weeks ago

Why was this change made? 🤔

(hopefully) fixes #1281 -- when we generate new OCR files, we need to be sure they are replaced in cocina

How was this change tested? 🤨

Updated specs

System testing:

  1. Created an OCR item with integration test.
  2. Used Preassembly with a file manifest downloaded from Argo to update one of the TIFF files.
  3. Ran Text Extraction from Argo.
  4. Confirmed the OCR XML was updated.
peetucket commented 3 weeks ago

I like the changes you made to the logic too, it makes it clearer what is happening with updates vs adds

edsu commented 3 weeks ago

The unit tests for file update pass for me, but I've not seen the file update work properly in QA yet.

Initially this faillure was because the file_manifest.csv I was using to update the Item I had created did not have a column for sdrGeneratedText, so Preassembly was setting it to false which caused the initial OCR XML not to get overwritten.

Even with a fix deployed to Argo to include that sdr_generated_text column in the file_manifest.csv the value for sdrGeneratedText appears to be still be getting set to false by Preassembly. Perhaps there is something missing from Preassembly to set read and use it?

I've left some notes on how to replicate the problem here: https://github.com/sul-dlss/argo/pull/4475#pullrequestreview-2111518346

peetucket commented 3 weeks ago

The unit tests for file update pass for me, but I've not seen the file update work properly in QA yet.

Initially this faillure was because the file_manifest.csv I was using to update the Item I had created did not have a column for sdrGeneratedText, so Preassembly was setting it to false which caused the initial OCR XML not to get overwritten.

Even with a fix deployed to Argo to include that sdr_generated_text column in the file_manifest.csv the value for sdrGeneratedText appears to be still be getting set to false by Preassembly. Perhaps there is something missing from Preassembly to set read and use it?

I've left some notes on how to replicate the problem here: sul-dlss/argo#4475 (review)

The unit tests for file update pass for me, but I've not seen the file update work properly in QA yet.

Initially this faillure was because the file_manifest.csv I was using to update the Item I had created did not have a column for sdrGeneratedText, so Preassembly was setting it to false which caused the initial OCR XML not to get overwritten.

Even with a fix deployed to Argo to include that sdr_generated_text column in the file_manifest.csv the value for sdrGeneratedText appears to be still be getting set to false by Preassembly. Perhaps there is something missing from Preassembly to set read and use it?

I've left some notes on how to replicate the problem here: sul-dlss/argo#4475 (review)

I think this may fix this problem if I am understanding it correctly: https://github.com/sul-dlss/pre-assembly/pull/1478

Suggest deploying that branch of pre-assembly and trying again. I am not sure we ever thought about using file manifests in pre-assembly with OCR type content since file manifest are almost always used for media content.

edsu commented 3 weeks ago

I think this may fix this problem if I am understanding it correctly: sul-dlss/pre-assembly#1478

Suggest deploying that branch of pre-assembly and trying again. I am not sure we ever thought about using file manifests in pre-assembly with OCR type content since file manifest are almost always used for media content.

Yes, that fixed it!