kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
72 stars 20 forks source link

Explicitly specify behavior of baseline property in the presence of textangle property #112

Open p12tic opened 2 years ago

p12tic commented 2 years ago

This PR fixes an error in the specification related in interaction between baseline and textangle properties.

Currently the baseline property is underspecified: the polyline refers to "coordinate system of the line" which is not defined anywhere else in the document. This makes it unclear how baseline should be specified when textangle is non-zero.

The interpretation that textangle should be ignored would result in error in the specification because completely vertical text would have slope angle equal to positive or negative infinity which can not be represented by the current grammar.

Therefore, textangle should be taken into account. However, it's not clear how it should affect baseline because the specification does not constrain textangle to any specific angle.

This issue is fixed by explicitly specifying what coordinate system would be used for the baseline polynomial in all possible values of textangle property.

Due to this issue tesseract-ocr currently does not output baseline for non-horizontal text at all. Fixing the specification will hopefully allow to output baseline information in all cases.