proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

New problems with leading/trailing whitespace around linebreaks in text content #101

Closed proycon closed 3 years ago

proycon commented 3 years ago

I'm afraid we may have to add another chapter to our whitespace problems, this is the sequel to issue #88 ...

i have a paragraph with the following text:

    <p xml:id="FP-NOTD00223000001.text.r2">
      <t>
        <t-str id="FP-NOTD00223000001.text.r2.r2l1">s</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l2">Jceddeiinte NP</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l3">J:d WnnnN.. WVierden Novembe</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l4">XviC. teeetnegentigh en eijndiger g</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l5">Antantiee</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l6">etirgh</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l7">Jen Mlers</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l8">J: deWinter N.P.</t-str>
      </t>

This is produced by my latest additions to FoLiA-page (PageXML to FoLiA conversion, pagexml-br branch of foliautils). In addition, PageXML generates string annotations, which in turn relate back to the original PageXML:

      <str xml:id="FP-NOTD00223000001.text.r2.r2l1">
        <t offset="0">s</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l1" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l2">
        <t offset="2">Jceddeiinte NP</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l2" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l3">
        <t offset="17">J:d WnnnN.. WVierden Novembe</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l3" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l4">
        <t offset="46">XviC. teeetnegentigh en eijndiger g</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l4" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l5">
        <t offset="82">Antantiee</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l5" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l6">
        <t offset="92">etirgh</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l6" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l7">
        <t offset="99">Jen Mlers</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l7" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l8">
        <t offset="109">J: deWinter N.P.</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l8" type="str"/>
        </relation>
      </str>

The problem is, the offsets don't match up because of leading/trailing spaces. foliavalidator and folialint report the same:

TEXT VALIDATION ERROR: Text for String, ID FP-NOTD00223000001.text.r2.r2l2, textclass current, has incorrect offset 2 or invalid reference: Reference (ID FP-NOTD00223000001.text.r2, class=current) found but no text match at specified offset (2)! Expected 'Jceddeiinte NP', got '
 Jceddeiinte '

The full text the library sees, and which is produced by both folia2txt and FoLiA-2text. I marked leading/trailing whitespace with an underscore for visibility:

s_
_Jceddeiinte NP_
_J:d WnnnN.. WVierden Novembe_
_XviC. teeetnegentigh en eijndiger g_
_Antantiee_
_etirgh_
_Jen Mlers_
_J: deWinter N.P._

Note the initial whitespace for all but the first line. So where I'd expect S\nJ we get S\s\n\sJ instead. I think this is unexpected behaviour and qualifies as a bug we'd want to fix. The offsets as reported in the FoLiA-page output seem correct to me.

proycon commented 3 years ago

If everything in the text content (<t>) is put on a single line (without spaces or newlines), then everything validates fine.

This also shows that the cause of this issue are spaces caused by joining lines, which is behaviour we usually want to have:

  <t-str>foo</t-str>
  <t-str>bar</t-str>

The above should serialize as foo bar, with a space. The libraries do this correctly.

But... if we have an explicit linebreak:

  <t-str>foo</t-str>
  <br/>
  <t-str>bar</t-str>

then this no longer makes sense and we want foo\nbar and not foo\s\n\sbar. I think this is the core of this issue.