Terminology: Glyphs, characters, codepoints

https://github.com/kba/hocr-spec/issues/17#issuecomment-256117486

None of these are "per-glyph" because "glyph" isn't a uniquely defined concept independent of font. As far as hOCR is concerned, you need to output information per codepoint. There is no single correct way of doing that: it depends on the script, the encoding, and the OCR engine.

For bounding boxes (or cuts) on accented Western scripts, my recommendation would be: (1) view the whole accented letter as a single glyph, (2) use normalized unicode with composed characters, (3) if a single glyph corresponds to multiple codepoints, output a bounding box for the first codepoint and output empty bounding boxes for the remaining codepoints.

We should define it and s/character/codepoint in the spec.

kba / hocr-spec

Terminology: Glyphs, characters, codepoints #87