Closed robertknight closed 8 months ago
https://github.com/robertknight/ocrs/pull/32 (mostly) solved the accuracy problem, at the cost of much longer inference times (up to ~2x) for the longest lines. It might be possible to address this in future by padding rather than scaling long lines, and applying a similar transformation at training time.
Text recognition preprocessing currently resizes all input lines to be 64px high, and scales the width proportionally, but constrained to a maximum of 800px. The 800px max-width was a limit used during training to limit the max memory usage of batches.
Using the new
--text-line-images
option to save preprocessed text line images tolines/
, it becomes apparent that the max width limit can end up squashing text too much, causing some characters or spaces in long lines to be missed or for letters to be misidentified.Although the recognition model was trained with a max input width of 800px, it generalizes to longer sequence lengths, so we can actually use wider images at inference time. From a quick test it looks like doing so fixes accuracy errors with the image below.
Input image:
This is a screenshot from a feature Wikipedia article:
Example preprocessed line:
Recognition output:
With default 800px limit:
The size of the line in the source image is ~1310px x ~30px, so this gets squashed horizontally.
With 1600px limit (set here):