Closed dcao closed 7 months ago
Ocrs can provide character-level bounding boxes, but only after doing text recognition. The initial text detection phase produces only word-level bounding boxes, but the text recognition phase produces character-level bounding boxes as a side-effect of recognizing the text.
To get character bounding boxes:
OcrEngine::recognize_text
to get a Vec<Option<TextLine>>
(see also the hello_ocr
example)TextLine
you can call the TextItem::chars trait method to get TextChar
s which have a rect
property.An alternative method is to use TextLine::words
, and make the simplifying assumption that each character in the word has the same width (ie. average_char_width = word_width / n_chars_in_word
). From this you can approximate the position of a character using char_index * average_char_width
. This often produces results that are "good enough" for things like text selection.
Some notes on how recognition produces character boxes as a side effect:
The text recognition phase takes images of text lines as input and outputs an (alphabet size, 1/4 image width) matrix that encodes the sequence of characters. For each column the highest score is taken to get the character class ("greedy decoding"), and that produces a sequence with repeated characters plus separators ("blanks"). So if the text is "ocr", the output sequence might be ['o', 'o', 'o', <blank>, 'c', 'c', <blank>, 'r']
. Repetitions and blanks are removed to get "ocr". The start/end columns for each character can be mapped back to the input image to get the start/end X offsets of the character. The text recognition model training only scores the accuracy of the text ("CTC loss") , but the model naturally learns to produce reasonably good character offsets. More details on this approach to sequence modeling can be found in this article.
On this question:
I was curious if ocrs would be able to return the bounding boxes for individual characters on a page (or if there was some way to tune the sensitivity of word bounding box detection generally)
The text detection model outputs a binary map with a probability for each pixel being part of a text word. Internally there is a text/non-text threshold, but this is not currently exposed as an API. I might expose this in future as a setting, but improving the overall accuracy of the model will reduce the need to tune it.
Thanks a ton for the detailed response, really appreciated it! This is more than sufficient for my needs, so I'll go ahead and close the issue.
First off, thanks for making this project public; it's super cool work!
I was curious if ocrs would be able to return the bounding boxes for individual characters on a page (or if there was some way to tune the sensitivity of word bounding box detection generally). It seems like a lot of this code is in
detector.rs
, and I was curious how to tune the parameters to potentially achieve character-by-character bounding box detection, or if these parameters are tied to how the underlying model is trained itself.