robertknight / ocrs

Rust library and CLI tool for OCR (extracting text from images)
Apache License 2.0
1.1k stars 46 forks source link

Character bounding boxes #24

Closed dcao closed 7 months ago

dcao commented 7 months ago

First off, thanks for making this project public; it's super cool work!

I was curious if ocrs would be able to return the bounding boxes for individual characters on a page (or if there was some way to tune the sensitivity of word bounding box detection generally). It seems like a lot of this code is in detector.rs, and I was curious how to tune the parameters to potentially achieve character-by-character bounding box detection, or if these parameters are tied to how the underlying model is trained itself.

robertknight commented 7 months ago

Ocrs can provide character-level bounding boxes, but only after doing text recognition. The initial text detection phase produces only word-level bounding boxes, but the text recognition phase produces character-level bounding boxes as a side-effect of recognizing the text.

To get character bounding boxes:

  1. Call OcrEngine::recognize_text to get a Vec<Option<TextLine>> (see also the hello_ocr example)
  2. On a TextLine you can call the TextItem::chars trait method to get TextChars which have a rect property.

An alternative method is to use TextLine::words, and make the simplifying assumption that each character in the word has the same width (ie. average_char_width = word_width / n_chars_in_word). From this you can approximate the position of a character using char_index * average_char_width. This often produces results that are "good enough" for things like text selection.

Some notes on how recognition produces character boxes as a side effect:

The text recognition phase takes images of text lines as input and outputs an (alphabet size, 1/4 image width) matrix that encodes the sequence of characters. For each column the highest score is taken to get the character class ("greedy decoding"), and that produces a sequence with repeated characters plus separators ("blanks"). So if the text is "ocr", the output sequence might be ['o', 'o', 'o', <blank>, 'c', 'c', <blank>, 'r']. Repetitions and blanks are removed to get "ocr". The start/end columns for each character can be mapped back to the input image to get the start/end X offsets of the character. The text recognition model training only scores the accuracy of the text ("CTC loss") , but the model naturally learns to produce reasonably good character offsets. More details on this approach to sequence modeling can be found in this article.

On this question:

I was curious if ocrs would be able to return the bounding boxes for individual characters on a page (or if there was some way to tune the sensitivity of word bounding box detection generally)

The text detection model outputs a binary map with a probability for each pixel being part of a text word. Internally there is a text/non-text threshold, but this is not currently exposed as an API. I might expose this in future as a setting, but improving the overall accuracy of the model will reduce the need to tune it.

dcao commented 7 months ago

Thanks a ton for the detailed response, really appreciated it! This is more than sufficient for my needs, so I'll go ahead and close the issue.