proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

[documentation] newlines and whitespace in FoLiA text content (<t>) #34

Closed proycon closed 6 years ago

proycon commented 6 years ago

This issue documents a fundamental issue with FoLiA's text content (<t>) that may leads to misunderstanding and requires more extensive documentation. It is especially relevant now FoLiA v1.5 introduces mandatory text validation and may identify problems caused by this.

A FoLiA text content block (<t>) is an XML mixed content node, such a node may consist of both text and elements, the latter being FoLiA text markup elements in this case (t-style, t-gap, br etc...). In practise it's often just text. When associated with a structural element that is not a word or morpheme, the text content expresses untokenised text. This means that spaces and newlines are significant.

Consider the following snippets:

A:

<s><t>This is a sentence</t></s>

B:

<s><t>This is
a sentence</t></s>

C:

<s><t>This is<br/>a sentence</t></s>

The text of sentence A is not equivalent to B or C, the text of B and C are equivalent.

Special caution is in order when spreading text content over multiple lines, this usually does not do mean what you might assume:

D:

<s>
    <t>This is
         a sentence</t>
</s>

Sentence D is not equivalent to B or C, it's text is This is\n\s\s\s\s\s\s\s\s\sa sentence.

This is in line with XML behaviour (quoting http://usingxml.com/Basics/XmlSpace):

.., if the element is declared as having mixed content, both text and element child nodes, then the XML parser must pass on all the white space found within the element.

It does differ from what people are accustomed to in HTML (hence some of the confusion perhaps), which considers whitespace insignificant far more frequently.

FoLiA v1.5 introduced mandatory text validation (#24), which checks if any text redundancy is consistent. This may bring to light issues such as described here. This text validation, however, still proceeds in a more flexible manner as it is insensitive to multiple spaces/newlines and operates on a normalised form. Explicit text offsets (if used), on the other hand, do not operate on a normalised form and are thus very strict, they are also validated as part of text validation.

Note for completeness that this discussion is limited to text content (<t>) and text markup elements therein, whitespaces/newlines in most other context, such as within structural elements, is not significant.

proycon commented 6 years ago

Issue arose from LanguageMachines/ucto#35

proycon commented 6 years ago

This issue is also somewhat related to #12 (CDATA), marking for future reference.

kosloot commented 6 years ago

I agree with this analysis, and conclusions. Still I think is it unwise and not recommendable to use this kind of implied formatting in XML and/or FoLiA. But we cannot force people to well-behave in this respect. So we need to do what we can to help them out :)