ocr_carea vs ocrx_block

kba commented 7 years ago

When should engines output the latter?

amitdo commented 7 years ago

Tesseract uses ocr_carea to represent what its API calls 'block'.

It seems that ocr_carea means 'column'. If that's the case, Tesseract needs to replace ocr_carea with ocrx_block. Tesseract API does not give you any info about columns, although internally it does have this info.

amitdo commented 7 years ago

I remembered that someone had complained about this. I just found this message: https://groups.google.com/forum/#!topic/tesseract-ocr/djenIdI5EbI

zuphilip commented 7 years ago

From the hocr-paper (formatting from me):

At the lowest level, the hOCR format represents OCR engine-specific, physical layout, like text blocks, images, and other page content. [...]

However, unlike typesetting markup, which is generally well-defined in terms of typesetting models, the kind of physical markup produced by OCR engines is implementation dependent. For example, a “text block” in an engine may be defined in terms of the existence of whitespace separators of minimal size, or the alignment of individual characters. Likewise, “blocks” are often also style-dependent; for example, a document rendered in a style with vertical inter-paragraph spacing may be represented with a single block for each paragraph, while in the same document rendered in a different style, an entire column of multiple paragraphs may be returned as a single block in the OCR system.

In contrast, in a typesetting model of page layout, these two styles would be represented in the same way as a flowable content area, which would also correspond to the underlying page layout in the source document in any of the standard typesetting systems.

I am not so sure when a OCR engine can output anything which is not implementation dependent, i.e. how an engine (or actually also a human) can know what the underlying page layout is by looking just at the rendered result.

kba / hocr-spec

ocr_carea vs ocrx_block #28