proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

add aliases (short names) for set definitions. #31

Closed kosloot closed 5 years ago

kosloot commented 7 years ago

At the moment, having more then one annotation set in scope, leads to a lot of bloat, example:

<w xml:id="WR-P-E-J-0000000001.p.1.s.2.w.16">
  <t>genealogie</t>
  <pos class="N(soort,ev,basis,zijd,stan)" set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mbpos-cgn"/>
  <lemma class="genealogie"/>
  <morphology>
    <morpheme class="complex">
    <t>genealogie</t>
    <feat class="[[genealogisch]adjective[ie]]noun/singular" subset="structure"/>
    <pos class="N" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-clex"/>
    <morpheme class="complex">
           <feat class="N_A*" subset="applied_rule"/>
           <feat class="[[genealogisch]adjective[ie]]noun" subset="structure"/>
           <pos class="N" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-clex"/>
           <morpheme class="stem">
             <t>genealogisch</t>
             <pos class="A" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-clex"/>
           </morpheme>
           <morpheme class="affix">
             <t>ie</t>
             <feat class="[ie]" subset="structure"/>
          </morpheme>
    </morpheme>
    <morpheme class="inflection">
        <feat class="singular" subset="inflection"/>
      </morpheme>
    </morpheme>
  </morphology>
</w>

set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mbpos-cgn"/> and especially set="http://ilk.uvt.nl/folia/sets/frog-mbpos-clex"/> are repeated a lot

Maybe it is a plan to introduce short-hand labels, like cgg-set and celex-set to avoid all the bloat.

Something like this:

<pos-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/frog-mbpos-cgn" annotator="frog" annotatortype="auto" label="cgn"/>
<pos-annotation annotator="frog-mbma-1.0" annotatortype="auto" datetime="2
017-04-20T16:48:45" set="http://ilk.uvt.nl/folia/sets/frog-mbpos-clex" label="celex"/>

Everywhere a set is used, you may use the label instead. When serializing the label, if provided, is preferred. Labels must be unique of course

proycon commented 7 years ago

Good idea, I'd suggest calling them alias rather than label perhaps, as label is something in set definitions already (the human readable label).

kosloot commented 7 years ago

I added an 'alias' mechanism to libfolia. In the 'alias' branch for now, as it imposes an ABI breach.

proycon commented 7 years ago

Still to be implemented for pynlpl (proycon/pynlpl#33)

kosloot commented 5 years ago

Well.... Given this document:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns="http://ilk.uvt.nl/folia" xmlns:xlink="http://www.w3.org/1999/xlink" xml:id="doc" version="0.8" generator="libfolia-v0.4">
  <metadata>
    <annotations>
      <division-annotation set="a-set" alias="a"/>
      <division-annotation set="b-set" alias="b"/>
      <token-annotation set="a-set" alias="b"/>
      <token-annotation set="b-set" alias="a"/>
    </annotations>
  </metadata>
  <text xml:id="text">
    <div set="a-set">
      <s id="s.1">
    <w id="w.1" class="WORD" set="b">
      <t>test</t>
    </w>
      </s>
    </div>
    <div set="b">
      <s id="s.2">
    <w id="w.2" class="WORD" set="b-set">
      <t>test</t>
    </w>
      </s>
    </div>
  </text>
</FoLiA>

libfolia's folialint accepts it, but pynlpl's foliavalidator says:

Error on line 5: Invalid attribute alias for element division-annotation
Error on line 5: Element annotations has extra content: division-annotation
Error on line 3: Element metadata failed to validate content
Error on line 2: Element FoLiA failed to validate content
VALIDATION ERROR against RelaxNG schema (stage 1/2), in tests/aliases.xml
Invalid attribute alias for element division-annotation, line 5

which is right here?

proycon commented 5 years ago

Right, that is addressed and solved in #65 (to be release still), so I think we can close this one.