Open kba opened 8 years ago
IMHO we should differentiate between a minus sign and a hyphen in hOCR. The distinction can happen automatically while OCR-processing in most cases but can also be the result of proofreading the hOCR, because not all cases, where a hyphen is actually a minus sign, can be detected automatically.
"\" is not bad in my opinion. What did you think about instead? - or something like that?
While there is a
hardbreak
property and references to soft hyphens in the spec but not actually an explicit recommendation on how hyphens should be handled.For example, ALTO has a
<HYP CONTENT="-"/>
element for hyphens.I see two options:
A: Encode hyphens as a minus sign
-
and part of the word it hyphenates;B: Encoding the hyphen as
­
or an inline spanPersonally, I prefer option A because that is more in line with the pragmatic nature of hOCR and makes the hOCR output more uniform for post-processing tools.
On the other hand, when converting to hOCR from ALTO, the information that a minus sign is actually a hyphen will be lost.
How about non-hyphen dashes? Should the spec offer guidance on how to encode these?