proycon / foliapy

An extensive Python library for dealing with FoLiA (Format for Linguistic Annotation) documents, a rich XML-based format for linguistic annotation finding application in Natural Language Processing (NLP). This library was formerly part of PyNLPl.
https://proycon.github.io/folia
GNU General Public License v3.0
18 stars 5 forks source link

improving handling of default reference for text offset #29

Open kosloot opened 2 years ago

kosloot commented 2 years ago

When a text content has an offset without a explicit reference, the offset is per definition relative to the text content of the nearest structure parent. In general this is OK, but there are structure elements that MAY NOT carry text. Notably <table> and <row>, maybe more. I suggest to extend the search for a suitable parent to the first structure parent that is allowed to carry text.

A simple addition that I already implemented in libfolia. Sample FoLiA to demonstrate the problem:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="tabel" generator="libfolia-v2.10" version="2.5.1">
  <metadata type="native">
    <annotations>
      <paragraph-annotation/>
      <division-annotation />
      <string-annotation/>
      <table-annotation/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
    </annotations>
  </metadata>
  <text xml:id="tabel.text">
    <div xml:id="tabel.text.div.1">
      <t>rij 1 veld 1</t>
      <table xml:id="tabel.">
        <row xml:id="tabel.row.1">
          <cell xml:id="tabel.row.1.cell.1">
            <t offset="0">rij 1 veld 1</t>
      </cell>
        </row>
      </table>
    </div>
  </text>
</FoLiA>

The most recent folialint from libfolia approves this.

But the current foliavalidator states:

EXT VALIDATION ERROR: Text for Cell, ID tabel.row.1.cell.1, textclass current, has incorrect offset 0 or invalid reference: Reference (ID tabel.row.1) has no such text (class=current)
(also checked against older rules prior to FoLiA v2.4.1)
VALIDATION ERROR on full parse by library (stage 2/3), in cell-offset-bug.xml
UnresolvableTextContent: Reference (ID tabel.row.1) has no such text (class=current)
proycon commented 5 months ago

I think that's a good solution for these edge cases yes.

Moving this to issue foliapy