Closed proycon closed 7 years ago
An initial proposal for a solution
Introduce a new common attribute textclass
on all token & span annotations. By default, if omitted, the value of textclass
is current
. This ensures backwards compatibility and allows us to by default omit an explicit class assignment (and save on verbosity), just as we do with <t class="current"> == <t>
.
An example of this on token annotation :
<w class="WORD" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.head.86.s.1.w.3">
<t>aengename</t>
<t class="contemporary">aangename</t>
<metric value="lexicon" class="modernisationsource"/>
<pos head="ADJ" class="ADJ(prenom,basis,met-e,stan)" confidence="0.885728"
textclass="contemporary">
<feat class="prenom" subset="positie"/>
<feat class="basis" subset="graad"/>
<feat class="met-e" subset="buiging"/>
<feat class="stan" subset="naamval"/>
</pos>
<lemma class="aangenaam" textclass="contemporary" />
<morphology>
<morpheme>
<t class="contemporary">aangenaam</t>
</morpheme>
<morpheme>
<t class="contemporary">e</t>
</morpheme>
</morphology>
<alt auth="no" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.head.86.s.1.w.3.alt.1">
<pos head="N" class="N(soort,ev,basis,zijd,stan)" confidence="0.976563">
<feat class="soort" subset="ntype"/>
<feat class="ev" subset="getal"/>
<feat class="basis" subset="graad"/>
<feat class="zijd" subset="genus"/>
<feat class="stan" subset="naamval"/>
</pos>
</alt>
<alt auth="no" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.head.86.s.1.w.3.alt.2">
<lemma class="aengename"/>
</alt>
<altlayers auth="no" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.head.86.s.1.w.3.altlayers.1">
<morphology>
<morpheme>
<t>aengename</t>
</morpheme>
</morphology>
</altlayers>
</w>
And on span annotation (the entity is a false-positive named entity but that's what Frog outputs):
<entities xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.entities.1">
<entity class="per" confidence="0.444326" set="http://ilk.uvt.nl/folia/sets/frog-ner-nl" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.entities.1.entity.3"
textclass="contemporary">
<wref id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.w.9" t="Verbleide"/>
</entity>
</entities>
<altlayers auth="no" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.altlayers.2">
<entities xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.altlayers.2.entities.1">
<entity class="per" confidence="0.401266" set="http://ilk.uvt.nl/folia/sets/frog-ner-nl" xml:id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.altlayers.2.entities.1.entity.3">
<wref id="_aar004aard01_01.TEI.2.text.body.div.div.p.88.s.1.w.9" t="Verbleyde"/>
</entity>
</entities>
</altlayers>
Things to note that are part of this proposal:
entity/@t
attribute now corresponds to the textclass
of the entity.textclass="current"
is the default and needs not be serialised.textclass
does not make an element unique like set
does. So for e.g. pos
, given a certain set, only one can be authoritative and the rest must be alternatives (in alt
), regardless of textclass
.morphology
and phonology
, correction
and str
are not span/token annotation and this issue does not apply to them (they explicitly take text content so this issue does not arise). They do not need and don't get a textclass
attribute. Whether we want to allow it on certain higher order elements is debatable: alignment
, desc
, comment
and perhaps even metric
may be candidates where it might make sense too.I think this proposal is accepted. I'll probably release a forward-compatible v1.4.3 release that already allows for this, so it's not held up by the other more complicated issues for v1.5.
Implemented the textclass attribute in libfolia 1.4.3 branch too, and in foliatest
FoLiA allows for multiple text content elements on the same structural elements, these other text content elements must carry a different class. This indicates an alternative text for the same element and is used for instance for pre-OCR vs. post-OCR or pre-normalisation vs. post-normalisation distinctions (e.g. by Ticcl), or for transliterations (e.g a text in a chinese characters as well as pinyin). The standard text class, is always
current
(the only case in which FoLiA predefines a class).In Nederlab, historical text is modernised, the modernised text is stored in the
contemporary
text class and the original historical text is in the defaultcurrent
class. Now the issue is that they want to annotate both spelling variants. Software such as Frog allows to specify what text class to use as input, and it is viable to run Frog multiple times, with some post-processing, and add alternative annotations that are based on a different text class input. This is what I currently implemented and which works okay.The problem with this approach , however, is that: The relation between annotations and text classes is not explicit. It is now merely a convention in my Nederlab pipeline that the alternatives are based on the historical text, whilst the authoritative annotations are based on the contemporary variant.
This is a limitation in FoLiA that should be thought about and remedied. In FoLiA annotations are tied to structural elements (e.g. words/tokens) rather than on any particular text surface form (all textual forms are equally valid and describe the same thing). How do we establish a link with a text class?
For morphology/phonology and corrections this issue does not occur as those explicitly use text content elements; but for normal token annotation and span annotation (
wref
) it is not and an elegant solution needs to be devised. A symptom of this problem is apparent also in the serialisation of thewref/@t
attribute, which always now always contains the current layer even if the span annotation was derived from another text layer.This issue also encroaches upon another (deliberate) limitation in FoLiA; the general inability to have multiple tokenisations (though there are already soms ways around this).