printprobability / qa-workflow

Quality Assurance testing for the Print & Probability book processing and ingestion pipeline
MIT License
0 stars 0 forks source link

New command for gathering statistics on data set #16

Closed jarmoza closed 11 months ago

jarmoza commented 11 months ago

Max suggested that gathering some stats on the image files in our QA data set will be a useful way of better curating our data set and understanding the QA pipeline bottlenecks.

The following metrics should be gathered over the given book directory via a new QA command data_stats. This can be implemented in the base class QA_Module in qa_utilities.py:

  1. File size per page
  2. number of pages
  3. width and height in pixels per image