kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
72 stars 20 forks source link

Is "ocr_carea" obligatory for representing text blocks? #107

Closed wrznr closed 5 years ago

wrznr commented 5 years ago

WRT https://github.com/filak/hOCR-to-ALTO/issues/10

Does the sequence of ocr_* elements represent a strict hierarchy?

<body>
  <div class="ocr_page">
    <div class="ocr_carea">
      <p class="ocr_par">
        <span class="ocr_line">
          <span class="ocrx_word">
            Yield
          </span>
        </span>
      </p>
    </div>
  </div>
</body>

I.e. Does every level of the hierarchy has to be present or are some of them "omittable"?

abbyy2hocr.xsl implements the latter while alto2hocr.xsl implements the first (as included into https://github.com/UB-Mannheim/ocr-fileformat).

kba commented 5 years ago

No, you cannot assume a strict hierarchy.

ocr_page is required.

ocr_line, while not required by the spec, probably should be. You can assume it is there.

ocr_carea should be used for print space / columns, but is not consistently.

ocr_par isn't either.

If ocrx_word are used, they are within ocr_line. Not by definition but by experience.

I wish I could give you a more stringent answer but the reality is a lot of documents produced over a long time by implementations based on a underdefined specification.