Closed robertknight closed 1 year ago
When the legacy engine is disabled, the
ColumnFinder::CorrectOrientation
function does not get called
I think we can fix the issue by enabling some parts of the code in this block instead of disabling the whole block of code when the legacy engine is disabled.
Basic Information
tesseract 5.3.0-19-ga3b9ac, compiled with
--disable-legacy
Operating System
macOS 13 Ventura
Compiler
clang 14.0
Current Behavior
When Tesseract is compiled with
--disable-legacy
, hOCR output reports each line as being upside-down (textangle 180
) and omits baseline information.Steps to reproduce:
In the generated
output.hocr
file,ocr_line
entries look like this:Expected Behavior
If orientation information isn't available I'd expect the image to always be treated as if it were page-up. So entries should look like this:
Suggested Fix
Internally, it looks like the issue is that:
ColumnFinder::text_rotation_
is initialized to a null vector. When the legacy engine is disabled, theColumnFinder::CorrectOrientation
function does not get called, and so this vector remains null.PageIterator::Orientation
, which does not handle this case correctly, as it converts this null vector toORIENTATION_PAGE_DOWN
- https://github.com/tesseract-ocr/tesseract/blob/a3b9acfa4a2f28a9956e830c7354875ebb7213b4/src/ccmain/pageiterator.cpp#L585textangle 180
and omits baseline infoSome fixes I tested locally were to change the initialization of
ColumnFinder::text_rotation_
to be the same as thenorotation
value inColumnFinder::CorrectOrientation
, or to change the logic inPageIterator::Orientation
to handle null rotation vectors by mapping them toORIENTATION_PAGE_UP
. I'm happy to submit a PR but I'm not sure the preferred way to go.