How to handle hyphens? - Githubissues

While there is a hardbreak property and references to soft hyphens in the spec but not actually an explicit recommendation on how hyphens should be handled.

For example, ALTO has a <HYP CONTENT="-"/> element for hyphens.

I see two options:

A: Encode hyphens as a minus sign - and part of the word it hyphenates;

<span class="ocr_line">
  <span class="ocrx_word">what-
</span>
<span class="ocr_line">
  <span class="ocrx_word">ever</span>
</span>

B: Encoding the hyphen as  or an inline span

<span class="ocr_line">
  <span class="ocrx_word">what&shy;
</span>
<span class="ocr_line">
  <span class="ocrx_word">ever</span>
</span>

Personally, I prefer option A because that is more in line with the pragmatic nature of hOCR and makes the hOCR output more uniform for post-processing tools.

On the other hand, when converting to hOCR from ALTO, the information that a minus sign is actually a hyphen will be lost.

How about non-hyphen dashes? Should the spec offer guidance on how to encode these?

kba / hocr-spec

How to handle hyphens? #7