proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

[proposal] token annotations on multi-word spans (group annotations) and discussion of other multi-word issues. #51

Closed proycon closed 5 years ago

proycon commented 6 years ago

Since it's inception, FoLiA makes a distinction between annotations on single tokens (or other single structural elements), and annotations made on spans of tokens. These are called token annotations and span annotation respectively, the former is implemented inline, using the natural hierarchy in XML, whereas the latter is a stand-off layer. Each particular annotation type (e.g lemma/pos/entities/syntax etc) is implemented as one of these forms. Whether a particular annotation type is implemented as a token or span annotation depends on the nature of the annotation type.

FoLiA is, by design, limited to a single tokenisation, or no tokenisation at all, in which case actual linguistic annotation abilities are limited. Tokens are represented as <w> (word) elements. How tokenisation should be performed is not prescribed by FoLiA but left to the tokeniser. Whitespace in a token is not prohibited (as long as the token contains more than just whitespace) so the notion of a word or token is a flexible one and the two concepts are not strongly distinguished.

However, it appears that more expressive flexibility is needed as challenges appear in the situations where: 1) a token annotation (e.g. pos, lemma) can not be be assigned a single token but only to multiple tokens, an extra complication being if the tokens are discontinuous. Consider seperable verbs in dutch for instance; in the sentence "ik hou mijn adem in", we may want to tag hou in with lemma inhouden and part-of-speech verb. This is currently not possible in FoLiA. 2) a token annotation (e.g. pos, lemma) can not be assigned to a single token but only a part of it, for instance in the case of a constraction (e.g "it's"). This is already largely solved by the morphology layer in FoLiA.

Both are symptoms of the same underlying theme; the lack of atomicity of the token/word. The most straightforward solution would seem to be to retokenize the document, but this is too rigid and not always feasible or desireable. Sometimes maintaining an explicit distinction between tokens/words/groups of words is needed.

Multi-token words

Consider the following FoLiA mock-example of three tokens which together form a compound noun:

<w xml:id="w.1">
    <t>dry</t>
</w>
<w xml:id="w.2" space="no">
    <t>-</t>
</w>
<w xml:id="w.3" space="no">
    <t>cleaning</t>
</w>

FoLiA already has facilities to express that a group of tokens forms some type of entity (named or otherwise), or to correct the tokens to a single new one (<correction>). But in cases where this all is undesireable, where you want to keep the tokens as-is because they were expressed in the original in thay way, but still express that it concerns a single word with a single part of speech tag and lemma; new facilities are needed to use token annotations with spans.

When looking at other formats; NAF makes an explicit distinction between tokens (wordforms) and what it calls terms, and then proceeds to annotate largely (not always consistently so) on the terms rather than the wordforms. An extension of FoLiA is therefore also needed for the NAF-FoLiA convertor (see issue cltl/NAFFoLiAPy#4, and as such maybe relevant also for @antske).

I propose the following: adding a facility to FoLiA that can group words (like any normal span annotation element, no news here) but that allows for token annotations within its scope.

I think the simplest and least intrusive way to do this is to expand the existing entity annotation, example:

<entities>
    <entity xml:id="wg.1" class="compoundnoun">
        <wref id="w.1" />
        <wref id="w.2" />
        <wref id="w.3" />
        <pos class="N" >
        <lemma class="dry-cleaning" />
    </entity>
</entities>

This would cover non-continguous spans just as well. Such an annotation would be declared in the header as follows:

<entity-annotation set="...." type="complex" />

The type attribute is new here and would default to simple, the current behaviour. The value complex is used for the proposed extension, to explicitly denote that we are allowing token annotations on entities. I want this attribute so we can explicitly distinguish the two, documents with the new complex entities pose extra challenges for FoLiA tools so we want to know whether this will happen from the declaration already.

Alternatives to this solution would be:

The motivation for the proposed solution is to keep changes as minimal and simple as possible and not introduce too many new things. Despite the simplicity of the change, it does have quite some implications for the tools and libraries.

I do not propose that other span elements can in turn refer (wref) to entities rather than tokens/words (there are already facilities for doing that anyway), and it would add unnecessary ambiguity.

Non-atomic tokens

In cases where we have a token annotation (e.g. pos, lemma) that can not be assigned to a single token but only a part of it, we can use the already existing morphology layer:

Consider the example of the English contraction it's:

<w xml:id="w.1">
    <t>it's</t>
    <morphology>
        <morpheme>
            <t>it</t>
            <pos class="pron" />
            <lemma class="it" />
        </morpheme>
        <morpheme>
            <t>'s</t>
            <pos class="v" />
            <lemma class="is" />
        </morpheme>
    </morphology>
</w>

Here I want to stress that this is not the only possible representation for this contraction, as we can just as well express it with two tokens as shown in the next example. It's not FoLiA's job to favour one over the other, but that is a decision of the creator/researcher/tokeniser, FoLiA just has to provide the facilities that make both models possible:

<w xml:id="w.1" space="no">
    <t>it</t>
    <pos class="pron" />
    <lemma class="it" />
</w>
<w xml:id="w.2">
    <t>'s</t>
    <pos class="v" />
    <lemma class="it" />
</w>

The morphology notation in FoLiA is very powerful and nestable. Consider the arabic token فيبيتك. This consists of three words meaning "in your house", translitterated in the below example for ease of reading:

<w xml:id="w.1">
    <t>fiybaytika</t>
    <morphology>
        <morpheme class="prefix" function="lexical">
            <t>fiy</t>
            <pos class="PREP" />
            <lemma class="fiy" />
        </morpheme>
        <morpheme class="stem" function="lexical">
            <t>bayti</t>
            <pos class="N">
                <feat subset="case" class="prep" />
            </pos>
            <lemma class="bayt" />
            <morpheme class="stem">
                <t>bayt</t>
                <pos class="N" />
            </morpheme>
            <morpheme class="suffix" function="inflectional">
                <t>i</t>
                <desc>prepositional marker</desc>
            </morpheme>
        </morpheme>
        <morpheme class="suffix" class="lexical">
            <t>ka</t>
            <pos class="PRON" />
            <lemma class="anta" />
        </morpheme>
    </morphology>
</w>

Morphemes (and phonemes) can explicitly be referred to (like of words/tokens) from any span annotation (wref).

My question (mainly for @kdepuydt, @JessedeDoes) if is this solution is sufficient (it can capture contractions, clitics, etc.. ) and linguistically accurate enough (e.g. grouping it all under morphology)? If there are counter-examples, I'd be very interested.

Compound classes

One point that arises from current annotations in the CRM and Gysseling corpora (historical dutch), is the use of what I call compound PoS-tags and lemmas. Take the arabic example above, the token itself does not have a PoS tag, but one may want to force a tag anyway and assign something like prep+n+pron. Recall that FoLiA itself does not define the tagset, so this would be valid. However, the semantics of it being some kind of compound class would not be formalised in any way. The question arises whether we need facilities for explicitly representing compound classes? Perhaps we should allow FoLiA set definitions to define operators such as +, allowing for more expressivity in classes. This as opposed to really defining operators in FoLiA itself, because that begs the question which operators are needed and that is more a property of the vocabulary in question. In categorial grammars for instance, one would want to define / and \. In other vocabularies, perhaps more set-theoretic operators such as and make sense. If operators are introduced, then of course bracketing and operator precendence becomes a factor to take into account a well. The class would cease to be a simple reference and allow for a mini-language in it's own right, although for many tools this is of no consequence.

I'm not yet including a specific proposal for this, but would very much like to hear your thoughts on this direction.

kosloot commented 6 years ago

A few remarks:

  1. I would prefer a wgroup annotation over a new complex entity. IMHO it is clearer, and less intrusive. Detecting this is as easy as searching for 'complex'
  2. I wonder what will happen when we also add 'real' morphological information. Wouldn't that interfere with your proposal? Probably not (using different 'set',) but still. Introducing yet another annotation is not desirable too.
  3. Extending FoLiA to a complete programming language with all kind of operators is a bad plan imho. Things like that should be left to external tools. If you need a set that is the merge of other sets, then just define a new set.

as a side note: Although word may contain spaces, most of our tools (starting with the UCTO tokenizer, but also the MBT tagger) are not capable of handling embedded spaces. A lot of work has to be done there.

JessedeDoes commented 6 years ago
  1. I have been using part up to now because it is agnostic with respect to the nature of the partial word
  2. I have no objection to words also being classified as a kind of morphemes. Others might.
  3. How would we tag 'simple' words? A single morpheme
    <w xml:id="w.1">
    <t>it</t>
    <morphology>
        <morpheme class='word'>
            <t>it</t>
            <pos class="pron" />
            <lemma class="it" />
        </morpheme>
    </morphology>
    </w>

    or no morphemes?

  4. For the separable verbs etc, I now use dependencies. This is not a good solution for cases like "ge gheven", but I feel it is OK for discontinuously split words (Alpino has an 'svp' dependency)
w xml:id="w.33227">
<t class="default">vyt</t>
<lemma class="uit"/>
<pos set="http://rdf.ivdnt.org/pos/mnl-morfcode" class="285"/>
<pos set="http://rdf.ivdnt.org/pos/cgn-mnl" class="ADV(bw-deel-ww)"/>
<part class="wordPart" n="1">
<feat subset="deel" class="bw-deel-ww"/>
<feat subset="pos" class="ADV"/>
<feat subset="lemma" class="uit"/>
</part>
</w>
<w xml:id="w.33228">
<t class="default">heeft</t>
<lemma class="hebben"/>
<pos set="http://rdf.ivdnt.org/pos/mnl-morfcode" class="213"/>
<pos set="http://rdf.ivdnt.org/pos/cgn-mnl" class="WW(hulp-of-koppel,pv,tgw,met-t)"/>
<part class="wordPart" n="1">
<feat subset="wwtype" class="hulp-of-koppel"/>
<feat subset="wvorm" class="pv"/>
<feat subset="pvtijd" class="tgw"/>
<feat subset="pvagr" class="met-t"/>
<feat subset="pos" class="WW"/>
<feat subset="lemma" class="hebben"/>
</part>
</w>
<w xml:id="w.33229">
<t class="default">gegheven</t>
<lemma class="geven"/>
<pos set="http://rdf.ivdnt.org/pos/mnl-morfcode" class="274"/>
<pos set="http://rdf.ivdnt.org/pos/cgn-mnl" class="WW(hoofd,part,met-n,hoofddeel-ww)"/>
<part class="wordPart" n="1">
<feat subset="wwtype" class="hoofd"/>
<feat subset="wvorm" class="part"/>
<feat subset="buiging" class="met-n"/>
<feat subset="deel" class="hoofddeel-ww"/>
<feat subset="pos" class="WW"/>
<feat subset="lemma" class="geven"/>
</part>
</w>
...
<dependencies>
<dependency class="separable-part">
<hd>
<wref id="w.33229" t="gegheven"/>
</hd>
<dep>
<wref id="w.33227" t="vyt"/>
</dep>
</dependency>
</dependencies>
proycon commented 6 years ago

Thanks for the reactions thus-far!

@JessedeDoes, in reaction to your points:

1) As I mentioned yesterday, the use of part like this is creative and not invalid since the validator does allow it, but it's not really as intended either. part is a structural element that splits the parent structure into parts (with deliberately vague semantics). As soon as you associate a t with part things will go wrong (as then the word has deeper text which will be authoritative). The documentation does state:

The part element, on the other hand, is more abstract and plays a role on a deeper level. It can be embedded within paragraphs, sentences, and most other structure elements, even words, though we have to again emphasize it should not be used for morphology, there are other solutions for that!

In your solution the set for part becomes a bit convoluted (though of course everybody is free to make up their own sets), it includes pos and lemma as subsets and circumvents the actual <pos> and <lemma> elements which are intended for this. In situations where you used <part>, <morphology> should suffice I think (if there are any reasons against this I'd gladly hear so as they fit perfectly in this discussion):

You did:

                    <w xml:id="Corpus-Gysseling-1-1_9464f509-128c-410e-833f-a9439ee8b83a.text.1.body.1.p.1.part.4.w.17">                                                                                                                                       
                       <t class="default">dyserine.</t>
                       <lemma class="DE+IJZEREN"/>
                       <pos set="http://rdf.ivdnt.org/pos/mnl-morfcode" class="470+101"/>
                       <pos set="http://rdf.ivdnt.org/pos/cgn-mnl"
                            class="LID(bep,zonder)+ADJ(ev,met-e)"/>
                       <part class="wordPart" n="1">
                          <feat subset="lwtype" class="bep"/>
                          <feat subset="buiging" class="zonder"/>
                          <feat subset="pos" class="LID"/>
                          <feat subset="lemma" class="DE"/>
                       </part>
                       <part class="wordPart" n="2">
                          <feat subset="getal" class="ev"/>
                          <feat subset="buiging" class="met-e"/>
                          <feat subset="pos" class="ADJ"/>
                          <feat subset="lemma" class="IJZEREN"/>
                       </part>
                    </w> 

It should be:

                    <w xml:id="Corpus-Gysseling-1-1_9464f509-128c-410e-833f-a9439ee8b83a.text.1.body.1.p.1.part.4.w.17">                                                                                                                                       
                       <t class="default">dyserine.</t>
                       <lemma class="DE+IJZEREN"/>
                       <pos set="http://rdf.ivdnt.org/pos/mnl-morfcode" class="470+101"/>
                       <pos set="http://rdf.ivdnt.org/pos/cgn-mnl"
                            class="LID(bep,zonder)+ADJ(ev,met-e)"/>
                       <morphology>
                        <morpheme class="wordPart" n="1">
                          <pos class="LID">
                            <feat subset="lwtype" class="bep"/>
                            <feat subset="buiging" class="zonder"/>
                          </pos>
                          <lemma de="DE"/>
                        </morpheme>
                       <morpheme class="wordPart" n="2">
                          <pos class="ADJ">
                            <feat subset="getal" class="ev"/>
                            <feat subset="buiging" class="met-e"/>
                          </pos>
                          <lemma class="IJZEREN"/>
                       </morpheme>
                    </w> 

(You can optionally also associate a text with the morphemes, like <t>d</t> and <t>yserine</t> in this case)

2)

I have no objection to words also being classified as a kind of morphemes. Others might.

Good. I don't want to enforce any linguistic theory with FoLiA and leave as much to the user as possible, I simply count all substructure inside tokens/words as morphology (and anything above the word/token level would be syntax). If there are counter-examples where this grossly inaccurate, I'd be glad to hear it.

3) There's no need for morphology with simple words, this suffices:

<w xml:id="w.1">
    <t>it</t>
    <pos class="pron" />
    <lemma class="it" />
</w>

Adding an extra morphology layer with one root morpheme and the same information is unconventional but possible if you insist, but it's rather redundant.

4)

For the separable verbs etc, I now use dependencies. This is not a good solution for cases like "ge gheven", but I feel it is OK for discontinuously split words (Alpino has an 'svp' dependency)

I think dependencies are quite a fair solution here yes to express the relation.

@kosloot I'll add a separate comment to address your concerns, otherwise I write too much again :)

proycon commented 6 years ago

I would prefer a wgroup annotation over a new complex entity. IMHO it is clearer, and less intrusive. Detecting this is as easy as searching for 'complex'

That was my initial thought as well. My concern that lead me on the other path was mainly that people might confuse entity and wgroup; or fail to distinguish the difference between them (one allowing token annotations, the other not), resulting in situations where people may use wgroup and entity for the same things (without any token annotations). It would be confusing for tools interested in parsing named entities or any kind of multiword unit if it had to take into account the possibility that the user put it in wgroup instead of entity.

But you have a point; adding a type parameter to the declaration might seem a bit ad-hoc since that would be a new thing.

I wonder what will happen when we also add 'real' morphological information. Wouldn't that interfere with your proposal? Probably not (using different 'set',) but still.

I don't think this is any less real morphological information :) But I see what you mean. You can just add morphology layers with different sets yes if you want to make conflicting subdivisions.

Introducing yet another annotation is not desirable too.

Yeah, I don't think calling this all morphology is a misnomer, but that's up for the experts to decide.

Extending FoLiA to a complete programming language with all kind of operators is a bad plan imho. Things like that should be left to external tools. If you need a set that is the merge of other sets, then just define a new set.

Complete programming language would be yet another step further :) I don't want to go there either. The idea was that specific operators would be FoLiA set definition thing, and only the notion of operators and brackets would be a FoLiA thing, but even then, distinguishing operators would only be relevant for things like deep validation, so most tools can be totally oblivious about it (just like many tools are current oblivious to set definitions all together and happily perform with non-existing sets). It would indeed be specialised external tools that actual deal with this information. It would make expressing more complex vocabularies possible, without explicitly enumerating all possibilities (which can blow up in complexity real fast).

proycon commented 5 years ago

I've been considering this issue a bit more. Although the use for entity is by far the most common. I think we can make it more generic by adding this possibility for any kind of span annotation, provided that it is explicitly declared in the declaration with an attribute groupannotations="yes" (I previously suggested type="complex" but that probably sounds too vague). I really want to make the explicit declaration necessary rather than the default because this introduces extra complexity that parsers need to be aware of.

"Group annotations" would then be the term for inline annotations (aka token annotations) on span annotations.

One downside is that the RelaxNG schema is not strict enough to take the declarations and the groupannotations attribute into account, but then again, the RelaxNG schema is not strict enough for the current FoLiA either, and validation by a proper validation tool as provided by the FoLiA libraries is always needed.

proycon commented 5 years ago

This has been implemented as proposed, and documented here: https://folia.readthedocs.io/en/latest/span_annotation_category.html#group-annotations-inline-annotations-on-span-annotations