ocropus / hocr-tools

Tools for manipulating and evaluating the hOCR format for representing multi-lingual OCR results by embedding them into HTML.
Other
371 stars 79 forks source link

Rotated text lines in hOCR output #148

Open stweil opened 5 years ago

stweil commented 5 years ago

This image contains a full page of vertical text lines. The hOCR ouput which was created by Tesseract 4.0 has no direct indicator which text lines are horizontal or vertical.

It might be interesting to have a filter program which detects the line orientation from the hOCR data by interpreting the coordinates of the bounding boxes.

A similar algorithm would be needed for rendering of the OCR results, for example in PDF output created by hocr-pdf or by Tesseract or in hocrjs.

See also issue #54.

zuphilip commented 5 years ago

This image contains a full page of vertical text lines.

Let us be more precise here: The lines are rotated by 90 degree clock-wise.

The hOCR output which was created by Tesseract 4.0 has no direct indicator which text lines are horizontal or vertical.

Well, but that should be improved first. I think that this rotation should be indicated by textangle property, see http://kba.cloud/hocr-spec/1.2/#textangle, but @kba might know better than I do.

In the Japanese text the lines are not rotated but the text direction is from top-to-bottom.

stweil commented 5 years ago

That spec says "angle in degrees by which textual content has been rotate[d] relative to the rest of the page". I think this is not very precise and helpful, because for the two pages in question, both pages would have the default value (0 °) as each line has the same rotation as "the rest of the page".

zuphilip commented 5 years ago

Tesseract 3.05 used to add textangle property, see e.g. https://raw.githubusercontent.com/zuphilip/ocr-fileformat-samples/3590006039022801e3847f67feb085b3872585be/samples/hocr/1.1/452114306.hocr . What happened with that?

I agree that the specs are not that clear about the details, see also https://github.com/kba/hocr-spec/issues/101.

stweil commented 5 years ago

That's an important hint. You are right, the old hOCR for the same image includes the textangle property. I'll open an issue for Tesseract.

stweil commented 1 year ago

hocr-extract-images currently ignores the textangle property, so line images with rotated text don't get rotated into a horizontal line (which is required for training).