Relation between annotations and text classes is not explicit

proycon commented 7 years ago

FoLiA allows for multiple text content elements on the same structural elements, these other text content elements must carry a different class. This indicates an alternative text for the same element and is used for instance for pre-OCR vs. post-OCR or pre-normalisation vs. post-normalisation distinctions (e.g. by Ticcl), or for transliterations (e.g a text in a chinese characters as well as pinyin). The standard text class, is always current (the only case in which FoLiA predefines a class).

In Nederlab, historical text is modernised, the modernised text is stored in the contemporary text class and the original historical text is in the default current class. Now the issue is that they want to annotate both spelling variants. Software such as Frog allows to specify what text class to use as input, and it is viable to run Frog multiple times, with some post-processing, and add alternative annotations that are based on a different text class input. This is what I currently implemented and which works okay.

The problem with this approach , however, is that: The relation between annotations and text classes is not explicit. It is now merely a convention in my Nederlab pipeline that the alternatives are based on the historical text, whilst the authoritative annotations are based on the contemporary variant.

This is a limitation in FoLiA that should be thought about and remedied. In FoLiA annotations are tied to structural elements (e.g. words/tokens) rather than on any particular text surface form (all textual forms are equally valid and describe the same thing). How do we establish a link with a text class?

For morphology/phonology and corrections this issue does not occur as those explicitly use text content elements; but for normal token annotation and span annotation (wref) it is not and an elegant solution needs to be devised. A symptom of this problem is apparent also in the serialisation of the wref/@t attribute, which always now always contains the current layer even if the span annotation was derived from another text layer.

This issue also encroaches upon another (deliberate) limitation in FoLiA; the general inability to have multiple tokenisations (though there are already soms ways around this).

proycon commented 7 years ago

An initial proposal for a solution

Introduce a new common attribute textclass on all token & span annotations. By default, if omitted, the value of textclass is current. This ensures backwards compatibility and allows us to by default omit an explicit class assignment (and save on verbosity), just as we do with <t class="current"> == <t>.

An example of this on token annotation :

         <w class="WORD" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.head.86.s.1.w.3">
              <t>aengename</t>
              <t class="contemporary">aangename</t>
              <metric value="lexicon" class="modernisationsource"/>
              <pos head="ADJ" class="ADJ(prenom,basis,met-e,stan)" confidence="0.885728" 
                textclass="contemporary">
                <feat class="prenom" subset="positie"/>
                <feat class="basis" subset="graad"/>
                <feat class="met-e" subset="buiging"/>
                <feat class="stan" subset="naamval"/>
              </pos>
              <lemma class="aangenaam" textclass="contemporary" />
              <morphology>
                <morpheme>
                  <t class="contemporary">aangenaam</t>
                </morpheme>
                <morpheme>
                  <t class="contemporary">e</t>
                </morpheme>
              </morphology>
              <alt auth="no" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.head.86.s.1.w.3.alt.1">
                <pos head="N" class="N(soort,ev,basis,zijd,stan)" confidence="0.976563">
                  <feat class="soort" subset="ntype"/>
                  <feat class="ev" subset="getal"/>
                  <feat class="basis" subset="graad"/>
                  <feat class="zijd" subset="genus"/>
                  <feat class="stan" subset="naamval"/>
                </pos>
              </alt>
              <alt auth="no" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.head.86.s.1.w.3.alt.2">
                <lemma class="aengename"/>
              </alt>
              <altlayers auth="no" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.head.86.s.1.w.3.altlayers.1">
                <morphology>
                  <morpheme>
                    <t>aengename</t>
                  </morpheme>
                </morphology>
              </altlayers>
            </w>

And on span annotation (the entity is a false-positive named entity but that's what Frog outputs):

            <entities xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.entities.1">
              <entity class="per" confidence="0.444326" set="http://ilk.uvt.nl/folia/sets/frog-ner-nl" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.entities.1.entity.3"
                 textclass="contemporary">
                <wref id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.w.9" t="Verbleide"/>
              </entity>
           </entities>
           <altlayers auth="no" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.altlayers.2">
              <entities xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.altlayers.2.entities.1">        
                <entity class="per" confidence="0.401266" set="http://ilk.uvt.nl/folia/sets/frog-ner-nl" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.altlayers.2.entities.1.entity.3">
                  <wref id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.w.9" t="Verbleyde"/>
                </entity>
              </entities>
            </altlayers>

Things to note that are part of this proposal:

the entity/@t attribute now corresponds to the textclass of the entity.
textclass="current" is the default and needs not be serialised.
Default rules for unique elements apply; textclass does not make an element unique like set does. So for e.g. pos, given a certain set, only one can be authoritative and the rest must be alternatives (in alt), regardless of textclass.
Elements such as morphology and phonology, correction and str are not span/token annotation and this issue does not apply to them (they explicitly take text content so this issue does not arise). They do not need and don't get a textclass attribute. Whether we want to allow it on certain higher order elements is debatable: alignment, desc, comment and perhaps even metric may be candidates where it might make sense too.

proycon commented 7 years ago

I think this proposal is accepted. I'll probably release a forward-compatible v1.4.3 release that already allows for this, so it's not held up by the other more complicated issues for v1.5.

kosloot commented 7 years ago

Implemented the textclass attribute in libfolia 1.4.3 branch too, and in foliatest

proycon / folia

Relation between annotations and text classes is not explicit #29