Open kba opened 8 years ago
The bounding boxes of lines can overlap and i think this happens easily if the picture is a little skewed and/or letters are going heavily under baseline etc. See for example
True, in hocr-check, the assumption is that the overlap should not be more than 20% of the area for lines, paragraphs and careas.
I think it is interesting to calculate this fraction of overlap, but I don't think it fits into the check, because any number (e.g. 20%) is arbitrary and not really against the specs. Isn't the same routine part of hocr-eval-geom
and/or hocr-eval
?
There significant_overlap
is 10%.
Whatever the numbers actually are, it would be helpful to indicate that for typesetting elements, element containment should imply bounding box containment, in most cases. Other elements cannot sensibly have a bounding box (like ocr_chapter
), that floating elements can overlap containers but should not (or should they?) overlap other contained elements, e.g. an advertisement in a newspaper can cross two ocr_carea
but should not cross ocr_line
in those careas, stuff like that.
I think it is interesting to calculate this fraction of overlap, but I don't think it fits into the check,
Why not? It must not be a fatal error but it's a helpful measure to evaluate page segmentation.
In many cases that should not happen (words, lines). For floats, it's inevitable.
hocr-check actually checks for this but it is not spelled out somewhere AFAIK.