kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
72 stars 20 forks source link

cuts and x_bboxes #79

Open kba opened 7 years ago

kba commented 7 years ago

Why have mechanisms for relative and absolute positioning of codepoints within a word/cinfo?

Why not a bboxes attribute without the engine-specific prefix?

Related to #69

kba commented 7 years ago

https://github.com/kba/hocr-spec/issues/17#issuecomment-256117486

The "cuts" attribute is for representing cuts. It exists as a compact, pixel-accurate representation of a character segmentation. Cuts are not bounding boxes, and, in fact, are not all that useful unless you have the original page image available.

kba commented 7 years ago

https://github.com/kba/hocr-spec/issues/17#issuecomment-256131662

Cuts are for pixel-accurate segmentation in the presence of kerning, something bounding boxes can't represent.

def decode_cuts(s, x=0, ymax=None):
    print repr(x)
    cuts = []
    for path in s.split():
        turns = [int(p) for p in path.split(",")]
        print repr(x), repr(turns)
        x += turns[0]
        pos = [x, 0]
        cut = [tuple(pos)]
        for i, d in enumerate(turns[1:]):
            pos[(i+1)%2] += d
            cut.append(tuple(pos))
        if ymax is not None:
            pos[1] = ymax
            cut.append(tuple(pos))
        cuts.append(cut)
    return cuts

To convert these to tight bounding boxes, you need the original binary image (it's another 10-20 lines to do that conversion).

kba commented 7 years ago

@mttagessen in https://github.com/kba/hocr-spec/issues/17#issuecomment-256213506

My point with the x_cuts, xconfs, x* still stands even if you cut it down to a single engine and reencoding existing output. Without access to the particular model it is still impossible to align confidences/bboxes with code points even when you can make sure that nobody "tampered" with the file by renormalizing it to another Unicode normalization. The fundamental reason is that there is no mapping between Unicode code points and recognition units. Formats like AbbyyXML actually allow this alignment by being designed bottom-up (glyph-first) instead of top down like hOCR. I use "glyph" as the lowest level of label an engine may produce.

While per-character bounding boxes are indeed rather useless (and techniques like CTC layers may or may not produce them randomly), quite a few people seem keen on confidences for postprocessing.

kba commented 7 years ago

Kerning:

image