Merge Stats Refactor and Error Processing Fixes

jarmoza commented 9 months ago

Mid last week, I came across some errors in the stats output and error processing code on the qa_workflow side.

Specifically there was code that needed to be adapted from how autocrop QA was handling its output from the autocropper vs how line extraction QA is handling output from the watershed line extraction. For watershed line extraction QA the QA workflow is: 'clear', 'run', 'collate' (and specifically meaning, 'collate_errors'), and then 'output_stats'. The latter two steps are reversed in autocropping QA. A revised error handling and new stats merging function was necessary for the outputs we actually want from line extraction QA. Part of this work also includes readying line extraction QA (qa_line_extraction.py) for the upcoming, new line extraction method.

Below is a checklist of the work that is necessary/has been done for this so far.

[X] Refactor helper functions for line extraction QA's output_stats for computing metrics and tallying stats across books and pages.
[X] Make sure this functionality is separated out as code for watershed line extraction in preparation for the new line extraction method
[X] Make sure error files are being properly written out by run_dhsegment.py and watershed_line_extraction.py
- [X] Change their write location to new, common folder like 'qa_results' inside the input book directory NOTES: dhsegment error source: tbraddyll_R4267_duke_8_essaytoheraldry1684 watershed error source: tbraddyll_R4267_duke_8_essaytoheraldry1684 dhsegment error output le_dhsegmenterrors.txt in /qa_results le_watershederrors.txt in /qa_results
[X] Make sure error files per book are being merged from two files into one file
- [X] Make sure error files are all read from the new 'qa_results' folder inside the input book directory
[X] Write new function __correlate_errors_watershed_merge_all() in qa_line_extraction.py to put all errors into one csv file
- [X] This should be called before output_stats() - just as correlate_errors() is already called before output_stats()
[X] Sample necessary code from collate_results functions for merge stats function and then remove collate_results and its helpers
[ ] Make sure errors are properly tallied for book and run level results files NOTES: anon_R11260_wellcome_4_generalhistoryair1692 has no stats file - it errors out during watershed tbraddyll has significant traceback errors
[X] Take outputs - book level results, run level summary results, and errors - and put them into one Google sheets file
[X] In order to temporarily facilitate coordination of runs of the QA script between 'collate_errors' and 'output_stats' a new config yaml variable ERROR_RUN_UUID was added for use in the __tally_stats_for_book_watershed function. This allows the script to find the appropriate error file in the qa_results directory output by watershed line extraction

jarmoza commented 9 months ago

Just to update here: total_errors appears to be inaccurate for the book level stats – and likely unique_errors follows suit.

Needs investigation.

jarmoza commented 8 months ago

Blocked until new line extraction method, eynollah, is integrated into QA line extraction code.

jarmoza commented 7 months ago

Just to update here: total_errors appears to be inaccurate for the book level stats – and likely unique_errors follows suit.

Needs investigation.

After reviewing the reported errors in a meeting soon after this comment, I recall that most of them could be disregarded as warnings/will-not-fix errors coming from the watershed code.

printprobability / qa-workflow

Merge Stats Refactor and Error Processing Fixes #21