Closed wrznr closed 5 years ago
No, you cannot assume a strict hierarchy.
ocr_page
is required.
ocr_line
, while not required by the spec, probably should be. You can assume it is there.
ocr_carea
should be used for print space / columns, but is not consistently.
ocr_par
isn't either.
If ocrx_word
are used, they are within ocr_line
. Not by definition but by experience.
I wish I could give you a more stringent answer but the reality is a lot of documents produced over a long time by implementations based on a underdefined specification.
WRT https://github.com/filak/hOCR-to-ALTO/issues/10
Does the sequence of
ocr_*
elements represent a strict hierarchy?I.e. Does every level of the hierarchy has to be present or are some of them "omittable"?
abbyy2hocr.xsl
implements the latter whilealto2hocr.xsl
implements the first (as included into https://github.com/UB-Mannheim/ocr-fileformat).