Open kba opened 8 years ago
... which is also part of the issue #28.
Maybe, it was argued that columns cannot be style-independent and therefore there cannot be a ocr_*
property, i.e. in a LaTeX (source) document I just see the text content and the splitting into several columns is done in the rendering phase. (guesswork^^)
Generally, I'd also favor being careful with document-level semantics but nesting ocr_carea
for everything is also problematic. At least, it should be easy to differentiate between print space and connected boxes (like columns, paragraphs) below that level. ALTO has PrintSpace, ComposedBlock, TextBlock for these purposes.
I agree that we may change this things somehow in the future. I could envision that on a page there are different (text) content areas, which might discontinued by a picture or some other special content. Moreover, there can be special (text) content areas as the footer, header or a marginalia. A (text) content area can then contain several columns, which may be divided into several text blocks. However, I don't know how "near" this is to what the specs are currently saying.
I agree that we may change this things somehow in the future. I could envision that on a page there are different (text) content areas, which might discontinued by a picture or some other special content.
That's the intention of cflow
/ocr_linear
I think.
I imagine ocr_linear
in the same way as <article>
in HTML, where everything inside (maybe excluding some floats) has a reading order but the ocr_linear
s itself may not have a canonical reading order. The property cflow
I don't understand.
To answer that question, we need to understand
ocr_carea
...