proycon / foliapy

An extensive Python library for dealing with FoLiA (Format for Linguistic Annotation) documents, a rich XML-based format for linguistic annotation finding application in Natural Language Processing (NLP). This library was formerly part of PyNLPl.
https://proycon.github.io/folia
GNU General Public License v3.0
18 stars 5 forks source link

text offset errot not detected? #15

Closed kosloot closed 4 years ago

kosloot commented 4 years ago

In an example given in https://github.com/proycon/folia/issues/75 there seems to be an offset error, which goes undetected by foliavalidator. full example:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="2.2">
  <metadata type="native">
    <annotations>
      <correction-annotation />
      <text-annotation />
      <sentence-annotation />
      <string-annotation />
    </annotations>
  </metadata>
  <text xml:id="bug">
    <s xml:id="s.1">
      <t>Dit is een test</t>
      <t class="ocr">D!t 1S tezt</t>
      <str xml:id="str.1">
        <t offset="0">Dit</t>
        <t offset="0" class="ocr">D!t</t>
      </str>
      <str xml:id="str.2">
        <t offset="4">is</t>
        <t offset="4" class="ocr">1S</t>
      </str>
      <str xml:id="str.4">
        <t offset="11">test</t>
        <t offset="7" class="ocr">tezt</t>
      </str>
      <str xml:id="str.3"> <!-- I'm deliberately messing with the ordering here to emphasise that it has no meaning with strings-->
        <t offset="7">een</t>
      </str>
      <!-- and below an extra string example to emphasise that strings are not tokens: this overlaps with str.1 and str.2) -->
      <str xml:id="str.bonus">
        <t offset="3">t is</t>
        <t offset="3" class="ocr">t 1S</t>
      </str>
    </s>
  </text>
</FoLiA>

foliavalidator is happy with it:

foliavalidator tests/textproblem_3.xml 
Validated successfully: tests/textproblem_3.xml

folialint rejects this:

folialint tests/textproblem_3.xml
tests/textproblem_3.xml failed: Unresolvable text: Text for str(ID=str.bonus, textclass='current'), has incorrect offset 3
    original msg=Unresolvable text: Reference (ID s.1,class='current') found, but no text match at offset=3 Expected 't is' but got ' is '

Which is as desired, while the offset should be '2'.

proycon commented 4 years ago

You're right, it should be 2 and foliavalidator should detect it.