proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

Relation between annotations and text classes is not explicit #29

Closed proycon closed 7 years ago

proycon commented 7 years ago

FoLiA allows for multiple text content elements on the same structural elements, these other text content elements must carry a different class. This indicates an alternative text for the same element and is used for instance for pre-OCR vs. post-OCR or pre-normalisation vs. post-normalisation distinctions (e.g. by Ticcl), or for transliterations (e.g a text in a chinese characters as well as pinyin). The standard text class, is always current (the only case in which FoLiA predefines a class).

In Nederlab, historical text is modernised, the modernised text is stored in the contemporary text class and the original historical text is in the default current class. Now the issue is that they want to annotate both spelling variants. Software such as Frog allows to specify what text class to use as input, and it is viable to run Frog multiple times, with some post-processing, and add alternative annotations that are based on a different text class input. This is what I currently implemented and which works okay.

The problem with this approach , however, is that: The relation between annotations and text classes is not explicit. It is now merely a convention in my Nederlab pipeline that the alternatives are based on the historical text, whilst the authoritative annotations are based on the contemporary variant.

This is a limitation in FoLiA that should be thought about and remedied. In FoLiA annotations are tied to structural elements (e.g. words/tokens) rather than on any particular text surface form (all textual forms are equally valid and describe the same thing). How do we establish a link with a text class?

For morphology/phonology and corrections this issue does not occur as those explicitly use text content elements; but for normal token annotation and span annotation (wref) it is not and an elegant solution needs to be devised. A symptom of this problem is apparent also in the serialisation of the wref/@t attribute, which always now always contains the current layer even if the span annotation was derived from another text layer.

This issue also encroaches upon another (deliberate) limitation in FoLiA; the general inability to have multiple tokenisations (though there are already soms ways around this).

proycon commented 7 years ago

An initial proposal for a solution


Introduce a new common attribute textclass on all token & span annotations. By default, if omitted, the value of textclass is current. This ensures backwards compatibility and allows us to by default omit an explicit class assignment (and save on verbosity), just as we do with <t class="current"> == <t>.

An example of this on token annotation :

         <w class="WORD" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.head.86.s.1.w.3">
              <t>aengename</t>
              <t class="contemporary">aangename</t>
              <metric value="lexicon" class="modernisationsource"/>
              <pos head="ADJ" class="ADJ(prenom,basis,met-e,stan)" confidence="0.885728" 
                textclass="contemporary">
                <feat class="prenom" subset="positie"/>
                <feat class="basis" subset="graad"/>
                <feat class="met-e" subset="buiging"/>
                <feat class="stan" subset="naamval"/>
              </pos>
              <lemma class="aangenaam" textclass="contemporary" />
              <morphology>
                <morpheme>
                  <t class="contemporary">aangenaam</t>
                </morpheme>
                <morpheme>
                  <t class="contemporary">e</t>
                </morpheme>
              </morphology>
              <alt auth="no" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.head.86.s.1.w.3.alt.1">
                <pos head="N" class="N(soort,ev,basis,zijd,stan)" confidence="0.976563">
                  <feat class="soort" subset="ntype"/>
                  <feat class="ev" subset="getal"/>
                  <feat class="basis" subset="graad"/>
                  <feat class="zijd" subset="genus"/>
                  <feat class="stan" subset="naamval"/>
                </pos>
              </alt>
              <alt auth="no" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.head.86.s.1.w.3.alt.2">
                <lemma class="aengename"/>
              </alt>
              <altlayers auth="no" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.head.86.s.1.w.3.altlayers.1">
                <morphology>
                  <morpheme>
                    <t>aengename</t>
                  </morpheme>
                </morphology>
              </altlayers>
            </w>

And on span annotation (the entity is a false-positive named entity but that's what Frog outputs):

            <entities xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.entities.1">
              <entity class="per" confidence="0.444326" set="http://ilk.uvt.nl/folia/sets/frog-ner-nl" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.entities.1.entity.3"
                 textclass="contemporary">
                <wref id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.w.9" t="Verbleide"/>
              </entity>
           </entities>
           <altlayers auth="no" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.altlayers.2">
              <entities xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.altlayers.2.entities.1">        
                <entity class="per" confidence="0.401266" set="http://ilk.uvt.nl/folia/sets/frog-ner-nl" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.altlayers.2.entities.1.entity.3">
                  <wref id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.w.9" t="Verbleyde"/>
                </entity>
              </entities>
            </altlayers>

Things to note that are part of this proposal:

proycon commented 7 years ago

I think this proposal is accepted. I'll probably release a forward-compatible v1.4.3 release that already allows for this, so it's not held up by the other more complicated issues for v1.5.

kosloot commented 7 years ago

Implemented the textclass attribute in libfolia 1.4.3 branch too, and in foliatest