New problems with leading/trailing whitespace around linebreaks in text content

I'm afraid we may have to add another chapter to our whitespace problems, this is the sequel to issue #88 ...

i have a paragraph with the following text:

    <p xml:id="FP-NOTD00223000001.text.r2">
      <t>
        <t-str id="FP-NOTD00223000001.text.r2.r2l1">s</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l2">Jceddeiinte NP</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l3">J:d WnnnN.. WVierden Novembe</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l4">XviC. teeetnegentigh en eijndiger g</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l5">Antantiee</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l6">etirgh</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l7">Jen Mlers</t-str>
        <br/>
        <t-str id="FP-NOTD00223000001.text.r2.r2l8">J: deWinter N.P.</t-str>
      </t>

This is produced by my latest additions to FoLiA-page (PageXML to FoLiA conversion, pagexml-br branch of foliautils). In addition, PageXML generates string annotations, which in turn relate back to the original PageXML:

      <str xml:id="FP-NOTD00223000001.text.r2.r2l1">
        <t offset="0">s</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l1" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l2">
        <t offset="2">Jceddeiinte NP</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l2" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l3">
        <t offset="17">J:d WnnnN.. WVierden Novembe</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l3" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l4">
        <t offset="46">XviC. teeetnegentigh en eijndiger g</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l4" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l5">
        <t offset="82">Antantiee</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l5" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l6">
        <t offset="92">etirgh</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l6" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l7">
        <t offset="99">Jen Mlers</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l7" type="str"/>
        </relation>
      </str>
      <str xml:id="FP-NOTD00223000001.text.r2.r2l8">
        <t offset="109">J: deWinter N.P.</t>
        <relation format="text/page+xml" xlink:href="NOTD00223000001.xml" xlink:type="simple">
          <xref id="r2l8" type="str"/>
        </relation>
      </str>

The problem is, the offsets don't match up because of leading/trailing spaces. foliavalidator and folialint report the same:

TEXT VALIDATION ERROR: Text for String, ID FP-NOTD00223000001.text.r2.r2l2, textclass current, has incorrect offset 2 or invalid reference: Reference (ID FP-NOTD00223000001.text.r2, class=current) found but no text match at specified offset (2)! Expected 'Jceddeiinte NP', got '
 Jceddeiinte '

The full text the library sees, and which is produced by both folia2txt and FoLiA-2text. I marked leading/trailing whitespace with an underscore for visibility:

s_
_Jceddeiinte NP_
_J:d WnnnN.. WVierden Novembe_
_XviC. teeetnegentigh en eijndiger g_
_Antantiee_
_etirgh_
_Jen Mlers_
_J: deWinter N.P._

Note the initial whitespace for all but the first line. So where I'd expect S\nJ we get S\s\n\sJ instead. I think this is unexpected behaviour and qualifies as a bug we'd want to fix. The offsets as reported in the FoLiA-page output seem correct to me.

proycon / folia

New problems with leading/trailing whitespace around linebreaks in text content #101