proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

Encoding of soft word breaks / hyphenation #66

Closed proycon closed 5 years ago

proycon commented 5 years ago

FoLiA currently can not properly encode soft word breaks, i.e. situations where a word is visually broken apart and hyphenated in the original text. Currently we see FoLiA's <br/> element (as a text markup element inside <t>) used in these situations, often with a hyphen or another symbol (artifact of an OCR system perhaps?):

<t> ... ook genoemd Tos-<br/>kaansche </t>
<t>...verzoek van den werk¬<br/>nemer...</t>

This, however, represents an explicit break and effectively splits a word into two tokens, which is semantically wrong when a soft break is intended and lead to all kinds of problems in further linguistic progressing. FoLiA is first and foremost concerned with accurate representation of the text, accurate linguistic units, and presentational representation comes secondary. We see this situation deteriorate in practice, as sometimes we see a word even gets split across paragraphs, which is wrong in all cases.

We may want to introduce a new element (<t-hbr/>?) to explicitly encode a hyphenised break (without a preceding hyphen symbol, it would be implied!), which most linguistic processing tools, especially tokenisers, can then simply ignore. Example:

<t>...verzoek van den werk<t-hbr/>nemer...</t>

Note that this is different from HTML's <wbr> and LaTeX's \hyp which represents an opportunity for wordbreak ( and probably there's also a unicode point for this) rather than the fact that there actually was a wordbreak/hyphenation. We're not so interested in representing those in FoLiA.

Related:

Tagging also @JesseDeDoes and @kdepuydt this is especially prevalent in INT material.

kosloot commented 5 years ago

Well, I would prefer ditching soft hyphens altogether, as they have no 'real meaning'. If you really want to keep ALL the formatting from the original text, then a solution like this is acceptable. And preferable above adding symbols like ¬ or <br/> which become part of the text.

proycon commented 5 years ago

Yes, this is of course only for scenarios where one really wants to encode soft breaks, nobody is required to do so of course.

proycon commented 5 years ago

This is now implemented (for the upcoming FoLiA 2.0), documentation: https://folia.readthedocs.io/en/latest/hyphenation_annotation.html