kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
72 stars 20 forks source link

Terminology: Glyphs, characters, codepoints #87

Open kba opened 7 years ago

kba commented 7 years ago

https://github.com/kba/hocr-spec/issues/17#issuecomment-256117486

None of these are "per-glyph" because "glyph" isn't a uniquely defined concept independent of font. As far as hOCR is concerned, you need to output information per codepoint. There is no single correct way of doing that: it depends on the script, the encoding, and the OCR engine.

For bounding boxes (or cuts) on accented Western scripts, my recommendation would be: (1) view the whole accented letter as a single glyph, (2) use normalized unicode with composed characters, (3) if a single glyph corresponds to multiple codepoints, output a bounding box for the first codepoint and output empty bounding boxes for the remaining codepoints.

We should define it and s/character/codepoint in the spec.