proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
GNU General Public License v3.0
60 stars 10 forks source link

Offset problems with "empty" TextMarkup elements #107

Open kosloot opened 1 year ago

kosloot commented 1 year ago

given this, a bit weird, FoLiA file

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="" xmlns="" xml:id="bugxx" generator="libfolia-v1.11" version="2.5">
  <metadata type="native">
      <text-annotation set=""/>
  <text xml:id="bug">
    <div xml:id="bug.div">
      <p xml:id="bug.div.p">
        <s xml:id="bug.div.p.s.2">
          <str xml:id="bug.div.p.s.2.str.1">
            <t offset="0">appel</t>
          <str xml:id="bug.div.p.s.2.str.2">
            <t offset="5"><t-hbr>-</t-hbr></t>
          <str xml:id="bug.div.p.s.2.str.3">
            <t offset="5">taart</t>

This is accepted by folialint (latest GIT version), But rejected byfoliavalidator The latter states:

TEXT VALIDATION ERROR: Text for String, ID bug.div.p.s.2.str.2, textclass current, has incorrect offset 5 or invalid reference: Reference (ID bug.div.p.s.2, class=current) found but no text match at specified offset (5)! Expected '', got 't', full text: 'appeltaart"
(also checked against older rules prior to FoLiA v2.4.1)
VALIDATION ERROR on full parse by library (stage 2/3), in tests/bug52-3.xml
UnresolvableTextContent: Reference (ID bug.div.p.s.2, class=current) found but no text match at specified offset (5)! Expected '', got 't', full text: 'appeltaart"

The problem is with the offset of the <t-hbr> element in the second <str> IMHO this should be 5, as folialint accepts. And, while it has a size off 0, the next <str> ALSO has that same offset, 5. This is a BUG

Both programs don't really handle this very well though. As can be shown by replacing the offset by a an out-of-band- value, like -1, 10 or 2894234 In that case both programs will validate the FoLiA

SOLUTION: I suppose that FoliA elements with the IMPLICITSPACE property should be defined to add 0 to the offset, AND: when an offset attribute is added, it should have a meaningful, correct value. Which might prove to be difficult, as the offset should be equal to that of the NEXT non-TextMarkup element, and there is no obligation to have an offset attribute there. (or even that that element exists)