printprobability / qa-workflow

Quality Assurance testing for the Print & Probability book processing and ingestion pipeline
MIT License
0 stars 0 forks source link

Limit cropping to 50 pages at a time #19

Open jarmoza opened 11 months ago

jarmoza commented 11 months ago

The current PnP pipeline only crops 50 pages at a time. The autocrop QA script should be adapted to fit this. This should address out of memory errors still be experienced.

jarmoza commented 11 months ago

It's possible that this should be extended to line extraction QA as well.

jarmoza commented 11 months ago

Some previous out of memory error books

slurmstepd: error: Detected 1 oom-kill event(s):
    slurm-output-cole_R223278_DNLM_2_sureguide1665_6bff7bd3-cd29-4cf6-ba83-a710b75e7872.out
    slurm-output-mclark_R31063_uklw_2_worksambroseparey1691_6bff7bd3-cd29-4cf6-ba83-a710b75e7872.out
    slurm-output-jgrismond_R20542_NjPT_4_viewofgovernment1662_6bff7bd3-cd29-4cf6-ba83-a710b75e7872.out
    slurm-output-mwhite_R8527_uk_2_grotiusthreebooks1682_6bff7bd3-cd29-4cf6-ba83-a710b75e7872.out
    slurm-output-anon_R2930_iur_8_twotreatisesofgov1690_6bff7bd3-cd29-4cf6-ba83-a710b75e7872.out
jarmoza commented 11 months ago

By limiting the number of pages via config, we will likely eliminate any potential out of memory issues we were seeing in previous QA implementations.

Suggestion is a PAGES_PER_THREAD variable in the config yaml for each QA module with a default/suggested limit of 50 pages.

jarmoza commented 10 months ago

Blocked until new line extraction method, eynollah, is integrated into QA line extraction code.