kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
72 stars 20 forks source link

How to treat bounding boxes that contradict reading order? #25

Open kba opened 8 years ago

kba commented 8 years ago

Due to its line segmentation, ocropus inserts ocr_line at the wrong position in the flow of elements, i.e. in the middle of another paragraph. From the bounding box it is clear that these should not be at this position.

Can we find some rules for bounding box - reading order dependency to catch such obvious(?) mistakes while still allowing complex layouts?

Related to #23