tesseract-ocr / tesseract

Tesseract Open Source OCR Engine (main repository)
https://tesseract-ocr.github.io/
Apache License 2.0
61.59k stars 9.44k forks source link

HOCR output always sets `textangle 180` and omits baseline info if Tesseract is compiled with `--disable-legacy` #4010

Closed robertknight closed 1 year ago

robertknight commented 1 year ago

Basic Information

tesseract 5.3.0-19-ga3b9ac, compiled with --disable-legacy

Operating System

macOS 13 Ventura

Compiler

clang 14.0

Current Behavior

When Tesseract is compiled with --disable-legacy, hOCR output reports each line as being upside-down (textangle 180) and omits baseline information.

Steps to reproduce:

./configure --disable-legacy
./tesseract some-image.jpg output hocr

In the generated output.hocr file, ocr_line entries look like this:

<span class='ocr_line' id='line_1_142' title="bbox 1334 3054 2119 3088; textangle 180; x_size 34; x_descenders 8; x_ascenders 9">

Expected Behavior

If orientation information isn't available I'd expect the image to always be treated as if it were page-up. So entries should look like this:

<span class='ocr_line' id='line_1_142' title="bbox 1334 3054 2119 3088; baseline 0 -8; x_size 34; x_descenders 8; x_ascenders 9">

Suggested Fix

Internally, it looks like the issue is that:

  1. ColumnFinder::text_rotation_ is initialized to a null vector. When the legacy engine is disabled, the ColumnFinder::CorrectOrientation function does not get called, and so this vector remains null.
  2. This null vector gets propagated to PageIterator::Orientation, which does not handle this case correctly, as it converts this null vector to ORIENTATION_PAGE_DOWN - https://github.com/tesseract-ocr/tesseract/blob/a3b9acfa4a2f28a9956e830c7354875ebb7213b4/src/ccmain/pageiterator.cpp#L585
  3. The HOCR renderer then maps this orientation value to textangle 180 and omits baseline info

Some fixes I tested locally were to change the initialization of ColumnFinder::text_rotation_ to be the same as the norotation value in ColumnFinder::CorrectOrientation, or to change the logic in PageIterator::Orientation to handle null rotation vectors by mapping them to ORIENTATION_PAGE_UP. I'm happy to submit a PR but I'm not sure the preferred way to go.

amitdo commented 1 year ago

3997 seems related to this issue.

amitdo commented 1 year ago

When the legacy engine is disabled, the ColumnFinder::CorrectOrientation function does not get called

https://github.com/tesseract-ocr/tesseract/blob/67841aa89ff8828e5ab1eeae5a9140483da20c23/src/ccmain/pagesegmain.cpp#L335-L406

I think we can fix the issue by enabling some parts of the code in this block instead of disabling the whole block of code when the legacy engine is disabled.