Encoding of soft word breaks / hyphenation

proycon commented 5 years ago

FoLiA currently can not properly encode soft word breaks, i.e. situations where a word is visually broken apart and hyphenated in the original text. Currently we see FoLiA's <br/> element (as a text markup element inside <t>) used in these situations, often with a hyphen or another symbol (artifact of an OCR system perhaps?):

<t> ... ook genoemd Tos-<br/>kaansche </t>

<t>...verzoek van den werk¬<br/>nemer...</t>

This, however, represents an explicit break and effectively splits a word into two tokens, which is semantically wrong when a soft break is intended and lead to all kinds of problems in further linguistic progressing. FoLiA is first and foremost concerned with accurate representation of the text, accurate linguistic units, and presentational representation comes secondary. We see this situation deteriorate in practice, as sometimes we see a word even gets split across paragraphs, which is wrong in all cases.

We may want to introduce a new element (<t-hbr/>?) to explicitly encode a hyphenised break (without a preceding hyphen symbol, it would be implied!), which most linguistic processing tools, especially tokenisers, can then simply ignore. Example:

<t>...verzoek van den werk<t-hbr/>nemer...</t>

Note that this is different from HTML's <wbr> and LaTeX's \hyp which represents an opportunity for wordbreak ( and probably there's also a unicode point for this) rather than the fact that there actually was a wordbreak/hyphenation. We're not so interested in representing those in FoLiA.

kosloot commented 5 years ago

Well, I would prefer ditching soft hyphens altogether, as they have no 'real meaning'. If you really want to keep ALL the formatting from the original text, then a solution like this is acceptable. And preferable above adding symbols like ¬ or <br/> which become part of the text.

proycon commented 5 years ago

Yes, this is of course only for scenarios where one really wants to encode soft breaks, nobody is required to do so of course.

proycon commented 5 years ago

This is now implemented (for the upcoming FoLiA 2.0), documentation: https://folia.readthedocs.io/en/latest/hyphenation_annotation.html

proycon / folia

Encoding of soft word breaks / hyphenation #66

52