kba / hocr-spec

The hOCR Embedded OCR Workflow and Output Format
http://kba.github.io/hocr-spec/1.2/
73 stars 20 forks source link

How to handle hyphens? #7

Open kba opened 8 years ago

kba commented 8 years ago

While there is a hardbreak property and references to soft hyphens in the spec but not actually an explicit recommendation on how hyphens should be handled.

For example, ALTO has a <HYP CONTENT="-"/> element for hyphens.

I see two options:

A: Encode hyphens as a minus sign - and part of the word it hyphenates;

<span class="ocr_line">
  <span class="ocrx_word">what-
</span>
<span class="ocr_line">
  <span class="ocrx_word">ever</span>
</span>

B: Encoding the hyphen as &shy; or an inline span

<span class="ocr_line">
  <span class="ocrx_word">what&shy;
</span>
<span class="ocr_line">
  <span class="ocrx_word">ever</span>
</span>

Personally, I prefer option A because that is more in line with the pragmatic nature of hOCR and makes the hOCR output more uniform for post-processing tools.

On the other hand, when converting to hOCR from ALTO, the information that a minus sign is actually a hyphen will be lost.

How about non-hyphen dashes? Should the spec offer guidance on how to encode these?

not-implemented commented 7 years ago

IMHO we should differentiate between a minus sign and a hyphen in hOCR. The distinction can happen automatically while OCR-processing in most cases but can also be the result of proofreading the hOCR, because not all cases, where a hyphen is actually a minus sign, can be detected automatically.

"\­" is not bad in my opinion. What did you think about instead? - or something like that?