kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
72 stars 20 forks source link

Why is ocr_column obsolete? #76

Open kba opened 7 years ago

amitdo commented 7 years ago

To answer that question, we need to understand ocr_carea...

zuphilip commented 7 years ago

... which is also part of the issue #28.

Maybe, it was argued that columns cannot be style-independent and therefore there cannot be a ocr_* property, i.e. in a LaTeX (source) document I just see the text content and the splitting into several columns is done in the rendering phase. (guesswork^^)

kba commented 7 years ago

Generally, I'd also favor being careful with document-level semantics but nesting ocr_carea for everything is also problematic. At least, it should be easy to differentiate between print space and connected boxes (like columns, paragraphs) below that level. ALTO has PrintSpace, ComposedBlock, TextBlock for these purposes.

zuphilip commented 7 years ago

I agree that we may change this things somehow in the future. I could envision that on a page there are different (text) content areas, which might discontinued by a picture or some other special content. Moreover, there can be special (text) content areas as the footer, header or a marginalia. A (text) content area can then contain several columns, which may be divided into several text blocks. However, I don't know how "near" this is to what the specs are currently saying.

kba commented 7 years ago

I agree that we may change this things somehow in the future. I could envision that on a page there are different (text) content areas, which might discontinued by a picture or some other special content.

That's the intention of cflow/ocr_linear I think.

zuphilip commented 7 years ago

I imagine ocr_linear in the same way as <article> in HTML, where everything inside (maybe excluding some floats) has a reading order but the ocr_linears itself may not have a canonical reading order. The property cflow I don't understand.