None of these are "per-glyph" because "glyph" isn't a uniquely defined
concept independent of font. As far as hOCR is concerned, you need to
output information per codepoint. There is no single correct way of doing
that: it depends on the script, the encoding, and the OCR engine.
For bounding boxes (or cuts) on accented Western scripts, my recommendation
would be: (1) view the whole accented letter as a single glyph, (2) use
normalized unicode with composed characters, (3) if a single glyph
corresponds to multiple codepoints, output a bounding box for the first
codepoint and output empty bounding boxes for the remaining codepoints.
We should define it and s/character/codepoint in the spec.
https://github.com/kba/hocr-spec/issues/17#issuecomment-256117486
We should define it and
s/character/codepoint
in the spec.