robertknight / ocrs

Rust library and CLI tool for OCR (extracting text from images)
Apache License 2.0
1.21k stars 54 forks source link

Improve recognition accuracy for long text lines #31

Closed robertknight closed 8 months ago

robertknight commented 8 months ago

Text recognition preprocessing currently resizes all input lines to be 64px high, and scales the width proportionally, but constrained to a maximum of 800px. The 800px max-width was a limit used during training to limit the max memory usage of batches.

Using the new --text-line-images option to save preprocessed text line images to lines/, it becomes apparent that the max width limit can end up squashing text too much, causing some characters or spaces in long lines to be missed or for letters to be misidentified.

Although the recognition model was trained with a max input width of 800px, it generalizes to longer sequence lengths, so we can actually use wider images at inference time. From a quick test it looks like doing so fixes accuracy errors with the image below.

Input image:

This is a screenshot from a feature Wikipedia article:

benty

Example preprocessed line:

line-0

Recognition output:

With default 800px limit:

The Benty Grange hanging bowlis a fragmentary Anglo-Saxon artitact trom the seventh century AD. Al
thaf remains are parts of twn escutcheons: bronze frames hat are usually circular and elaborately

The size of the line in the source image is ~1310px x ~30px, so this gets squashed horizontally.

With 1600px limit (set here):

The Benty Grange hanging bowl is a fragmentary Anglo-Saxon artifact from the seventh century AD, All
that remains are parts of two escutcheons: bronze frames that are usually circular and elaborately
robertknight commented 8 months ago

https://github.com/robertknight/ocrs/pull/32 (mostly) solved the accuracy problem, at the cost of much longer inference times (up to ~2x) for the longest lines. It might be possible to address this in future by padding rather than scaling long lines, and applying a similar transformation at training time.