Open stweil opened 5 years ago
This image contains a full page of vertical text lines.
Let us be more precise here: The lines are rotated by 90 degree clock-wise.
The hOCR output which was created by Tesseract 4.0 has no direct indicator which text lines are horizontal or vertical.
Well, but that should be improved first. I think that this rotation should be indicated by textangle
property, see http://kba.cloud/hocr-spec/1.2/#textangle, but @kba might know better than I do.
In the Japanese text the lines are not rotated but the text direction is from top-to-bottom.
That spec says "angle in degrees by which textual content has been rotate[d] relative to the rest of the page". I think this is not very precise and helpful, because for the two pages in question, both pages would have the default value (0 °) as each line has the same rotation as "the rest of the page".
Tesseract 3.05 used to add textangle
property, see e.g. https://raw.githubusercontent.com/zuphilip/ocr-fileformat-samples/3590006039022801e3847f67feb085b3872585be/samples/hocr/1.1/452114306.hocr . What happened with that?
I agree that the specs are not that clear about the details, see also https://github.com/kba/hocr-spec/issues/101.
That's an important hint. You are right, the old hOCR for the same image includes the textangle
property. I'll open an issue for Tesseract.
hocr-extract-images
currently ignores the textangle
property, so line images with rotated text don't get rotated into a horizontal line (which is required for training).
This image contains a full page of vertical text lines. The hOCR ouput which was created by Tesseract 4.0 has no direct indicator which text lines are horizontal or vertical.
It might be interesting to have a filter program which detects the line orientation from the hOCR data by interpreting the coordinates of the bounding boxes.
A similar algorithm would be needed for rendering of the OCR results, for example in PDF output created by hocr-pdf or by Tesseract or in hocrjs.
See also issue #54.