proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

Offset problems with "empty" TextMarkup elements #107

Open kosloot opened 1 year ago

kosloot commented 1 year ago

given this, a bit weird, FoLiA file

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="bugxx" generator="libfolia-v1.11" version="2.5">
  <metadata type="native">
    <annotations>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <division-annotation/>
      <paragraph-annotation/>
      <sentence-annotation/>
      <hyphenation-annotation/>
      <string-annotation/>
    </annotations>
  </metadata>
  <text xml:id="bug">
    <div xml:id="bug.div">
      <p xml:id="bug.div.p">
        <s xml:id="bug.div.p.s.2">
      <t>appel<t-hbr>-</t-hbr>taart</t>
          <str xml:id="bug.div.p.s.2.str.1">
            <t offset="0">appel</t>
      </str>
          <str xml:id="bug.div.p.s.2.str.2">
            <t offset="5"><t-hbr>-</t-hbr></t>
      </str>
          <str xml:id="bug.div.p.s.2.str.3">
            <t offset="5">taart</t>
      </str>
        </s>
      </p>
    </div>
  </text>
</FoLiA>

This is accepted by folialint (latest GIT version), But rejected byfoliavalidator The latter states:

TEXT VALIDATION ERROR: Text for String, ID bug.div.p.s.2.str.2, textclass current, has incorrect offset 5 or invalid reference: Reference (ID bug.div.p.s.2, class=current) found but no text match at specified offset (5)! Expected '', got 't', full text: 'appeltaart"
(also checked against older rules prior to FoLiA v2.4.1)
VALIDATION ERROR on full parse by library (stage 2/3), in tests/bug52-3.xml
UnresolvableTextContent: Reference (ID bug.div.p.s.2, class=current) found but no text match at specified offset (5)! Expected '', got 't', full text: 'appeltaart"

The problem is with the offset of the <t-hbr> element in the second <str> IMHO this should be 5, as folialint accepts. And, while it has a size off 0, the next <str> ALSO has that same offset, 5. This is a BUG

Both programs don't really handle this very well though. As can be shown by replacing the offset by a an out-of-band- value, like -1, 10 or 2894234 In that case both programs will validate the FoLiA

SOLUTION: I suppose that FoliA elements with the IMPLICITSPACE property should be defined to add 0 to the offset, AND: when an offset attribute is added, it should have a meaningful, correct value. Which might prove to be difficult, as the offset should be equal to that of the NEXT non-TextMarkup element, and there is no obligation to have an offset attribute there. (or even that that element exists)