kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
73 stars 20 forks source link

Under which conditions may bounding boxes overlap? #23

Open kba opened 8 years ago

kba commented 8 years ago

In many cases that should not happen (words, lines). For floats, it's inevitable.

hocr-check actually checks for this but it is not spelled out somewhere AFAIK.

zuphilip commented 8 years ago

The bounding boxes of lines can overlap and i think this happens easily if the picture is a little skewed and/or letters are going heavily under baseline etc. See for example

kba commented 8 years ago

True, in hocr-check, the assumption is that the overlap should not be more than 20% of the area for lines, paragraphs and careas.

zuphilip commented 8 years ago

I think it is interesting to calculate this fraction of overlap, but I don't think it fits into the check, because any number (e.g. 20%) is arbitrary and not really against the specs. Isn't the same routine part of hocr-eval-geom and/or hocr-eval?

zuphilip commented 8 years ago

There significant_overlap is 10%.

kba commented 8 years ago

Whatever the numbers actually are, it would be helpful to indicate that for typesetting elements, element containment should imply bounding box containment, in most cases. Other elements cannot sensibly have a bounding box (like ocr_chapter), that floating elements can overlap containers but should not (or should they?) overlap other contained elements, e.g. an advertisement in a newspaper can cross two ocr_carea but should not cross ocr_line in those careas, stuff like that.

I think it is interesting to calculate this fraction of overlap, but I don't think it fits into the check,

Why not? It must not be a fatal error but it's a helpful measure to evaluate page segmentation.