proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

'type conversions' in corrections #77

Open kosloot opened 4 years ago

kosloot commented 4 years ago

consider this very strange FoliA file:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="doc" generator="libfolia-v1.11" version="2.2">
  <metadata type="native">
    <annotations>
      <correction-annotation />
      <text-annotation />
      <pos-annotation set="bla"/>
      <paragraph-annotation />
    </annotations>
  </metadata>
  <text xml:id="bug">
    <correction>
      <new>
        <p xml:id="p">
          <t>paragraaf</t>
        </p>
      </new>
      <original>
        <pos xml:id="s" class="n">
        </pos>
      </original>
    </correction>
  </text>
</FoLiA>

Both foliavalidator and folialint accept this, but I assume this is abusing the correction node. My impression is, that we don't want a correction to modify the "type" of the subnode. So i suggest to add some limitation here. preferable that all arguments are of the same type. Like all \<w> or all \<t>

proycon commented 4 years ago

Agreed, type conversions should probably be checked and banned. Especially if it's also a category conversion (like inline annotation to structural as in your example)

kosloot commented 7 months ago

seems solved for libfolia: I added a check on type consistency

kosloot commented 7 months ago

@proycon your remark: Especially if it's also a category conversion (like inline annotation to structural as in your example) got me thinking. The solution that I implemented in libfolia is probably too harsh. It disallows changing 2 sentences into 1 paragraph with 2 embedded sentences, like in the example below:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="Walter" generator="libfolia-v2.12" version="2.5.3">
  <metadata type="native">
    <annotations>
      <token-annotation/>
      <paragraph-annotation/>
      <sentence-annotation/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <correction-annotation/>
    </annotations>
  </metadata>
  <text xml:id="Walter.text">
    <correction xml:id="Walter.correction.1">
      <new>
    <p xml:id="par">
          <s xml:id="Walter.corr.s.1">
            <t>Dit is een zin.</t>
          </s>
          <s xml:id="Walter.corr.s.2">
            <t>Dit is nog een zin.</t>
          </s>
    </p>
      </new>
      <original auth="no">
        <s xml:id="Walter.s.1">
          <t>Dit is een zin.</t>
        </s>
        <s xml:id="Walter.s.2">
          <t>Dit is nog een zin</t>
        </s>
      </original>
    </correction>
  </text>
</FoLiA>

Correcting structure should be possible. And maybe correcting the annotation type too? This will get rather complicated then.

BUT!!!. Bug alert! the following file is invalid FoLiA (as it should be)

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="Walter" generator="libfolia-v2.12" version="2.5.3">
  <metadata type="native">
    <annotations>
      <token-annotation/>
      <paragraph-annotation/>
      <sentence-annotation/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <correction-annotation/>
    </annotations>
  </metadata>
  <text xml:id="Walter.text">
    <row xml:id="par">
      <cell>
    <w>
      <t>Dit is een zin.
      </t>
    </w>
      </cell>
    </row>
  </text>
</FoLiA>

But we can create this abomination using a correction:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="folia.xsl"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="Walter" generator="libfolia-v2.12" version="2.5.3">
  <metadata type="native">
    <annotations>
      <token-annotation/>
      <paragraph-annotation/>
      <sentence-annotation/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
      <correction-annotation/>
    </annotations>
  </metadata>
  <text xml:id="Walter.text">
    <correction xml:id="Walter.correction.1">
      <new>
    <row xml:id="par">
          <cell>
        <w>
          <t>Dit is een zin.
          </t>
        </w>
      </cell>
    </row>
      </new>
      <original auth="no">
        <s xml:id="Walter.s.1">
          <t>Dit is een zin.</t>
        </s>
      </original>
    </correction>
  </text>
</FoLiA>

This is horrible!. I assume that the functions to check if a tag is appendble should look INTO the correction Lot of work en thinking is needed! @proycon please comment

kosloot commented 7 months ago

Additional questions, about WHICH corrections are acceptable.

  1. Structure to structure, seems OK to me. Like adding a Paragraph around Sentences
  2. Annotation to annotation? Like modifying a Pos to a Lemma ??? scary
  3. Annotation to structure? Or vice versa? That was the original issue, and may be ruled out, I assume