printprobability / qa-workflow

Quality Assurance testing for the Print & Probability book processing and ingestion pipeline
MIT License
0 stars 0 forks source link

Merge Stats Refactor and Error Processing Fixes #21

Open jarmoza opened 9 months ago

jarmoza commented 9 months ago

Mid last week, I came across some errors in the stats output and error processing code on the qa_workflow side.

Specifically there was code that needed to be adapted from how autocrop QA was handling its output from the autocropper vs how line extraction QA is handling output from the watershed line extraction. For watershed line extraction QA the QA workflow is: 'clear', 'run', 'collate' (and specifically meaning, 'collate_errors'), and then 'output_stats'. The latter two steps are reversed in autocropping QA. A revised error handling and new stats merging function was necessary for the outputs we actually want from line extraction QA. Part of this work also includes readying line extraction QA (qa_line_extraction.py) for the upcoming, new line extraction method.

Below is a checklist of the work that is necessary/has been done for this so far.

jarmoza commented 9 months ago

Just to update here: total_errors appears to be inaccurate for the book level stats – and likely unique_errors follows suit.

Needs investigation.

jarmoza commented 8 months ago

Blocked until new line extraction method, eynollah, is integrated into QA line extraction code.

jarmoza commented 7 months ago

Just to update here: total_errors appears to be inaccurate for the book level stats – and likely unique_errors follows suit.

Needs investigation.

After reviewing the reported errors in a meeting soon after this comment, I recall that most of them could be disregarded as warnings/will-not-fix errors coming from the watershed code.