proycon / folia

FoLiA: Format for Linguistic Annotation - FoLiA is a rich XML-based annotation format for the representation of language resources (including corpora) with linguistic annotations. A wide variety of linguistic annotations are supported, making FoLiA a useful format for NLP tasks and data interchange. Note that the actual Python library for processing FoLiA is implemented as part of PyNLPl, this contains higher-level tools that use the library as well as the full documentation, validation schemas, and set definitions
http://proycon.github.io/folia/
GNU General Public License v3.0
60 stars 10 forks source link

Referencing implicit/empty/ghost words from span annotation #58

Closed proycon closed 5 years ago

proycon commented 5 years ago

How to refer to words/tokens that are not actually there?

Discussed as part of proycon/flat#138

proycon commented 5 years ago

It is time to pick this up again now that a lot of progress on FoLiA v2 is made. Most of this discussion, with @luutuntin, took place in proycon/flat#138. We need an explicit mechanism to refer to tokens that do not really exist, mostly for syntactic movements.

I have a proposal: introduce a <hiddenw> element (hidden word/token), that may be used just like words/tokens (<w>) but which denotes a word/token that is explicitly not part of the original text, and therefore does not appear in normal text serialisation. It however, may be a valid target for <wref>. Example, following the earlier examples in proycon/flat#138 and @luutuntin's nice tree:

syntax_tree_empty_expletive_subject

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0">
        <desc>empty expletive subject</desc>
        <pos class="EX" />
    </hiddenw>
    <w xml:id="s.1.w.1" space="no">
        <t>Is</t>
        <pos class="BEP" />
    </w>
    <w xml:id="s.1.w.2">
        <t>n't</t>
        <pos class="NEG" />
    </w>
    <w xml:id="s.1.w.3">
        <t>a</t>
        <pos class="D" />
    </w>
    <w xml:id="s.1.w.4">
        <t>whole</t>
        <pos class="ADJ" />
    </w>
    <w xml:id="s.1.w.5">
        <t>lot</t>
        <pos class="N" />
    </w>
    <w xml:id="s.1.w.6" space="no">
        <t>left</t>
        <pos class="VAN" />
    </w>
    <w xml:id="s.1.w.7">
        <t>.</t>
        <pos class="PUNC" />
    </w>
    <syntax>        
        <su xml:id="s.1.su.1" class="IP-MAT">
            <su xml:id="s.1.su.2" class="NP-SBJ">
                <wref id="s.1.w.0" />
            </su>
            <su xml:id="s.1.su.3" class="VP">
                <su xml:id="s.1.su.4" class="BEP">
                    <wref id="s.1.w.1" />
                </su>
                <su xml:id="s.1.su.5" class="NEG">
                    <wref id="s.1.w.2" />
                </su>
                <su xml:id="s.1.su.6" class="VP">
                    <su xml:id="s.1.su.7" class="NP=LGS">
                        <wref id="s.1.w.3" />
                        <su xml:id="s.1.su.8" class="ADJP">
                            <wref id="s.1.w.4" />
                        </su>
                        <wref id="s.1.w.5" />
                    </su>
                    <wref id="s1.w.6" />
                </su>
            </su>
            <su class="PUNC">
                <wref id="s.1.w.7" />
            </su>
        </su>
    </syntax>
</s>

The hidden tokens would have their own annotation type and can be bound to a set, which allows for multiple hidden tokenisation layers, in case multiple are needed for different purposes. The <hiddenw> elements are a structure element (albeit one that is hidden by default) so may appear interleaved with the normal tokenisation layer. Existing expressions that operate on words should not be bothered by it. but libraries do need extra code to ensure this element is skipped in text serialisation (and text consistency) by default (I don't want to forbid text content (<t>) in the hidden tokens as there is probably good use for that).

Thoughts and comments welcome!

luutuntin commented 5 years ago

I'm happy with your proposal. I also agree that we shouldn't forbid text content (<t>) in the hidden tokens. For instance, we can use it in our example as below:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0">
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />
    </hiddenw>
    ...
</s>

When you mentioned that the hidden tokens would have their own annotation type, do you mean that hidden tokens can have the same token annotations as word tokens (<w>) do (such as part-of-speech, lemma, language identification, lexical semantic sense, domain, subjectivity), and will also have additional (hidden) token annotations such as annotation layer?

proycon commented 5 years ago

do you mean that hidden tokens can have the same token annotations as word tokens () do (such as part-of-speech, lemma, language identification, lexical semantic sense, domain, subjectivity)

Yes

and will also have additional (hidden) token annotations such as annotation layer?

I rather meant that hidden tokens would be a new specific annotation type itself and needs to be declared. There's no annotation layer associated with this type, as it's a structural element rather than a span element.

luutuntin commented 5 years ago

Thank you. What I mean by "additional (hidden) token annotations such as annotation layer" is, for example:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0">
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />`
        <annotation_layer='syntax'>
    </hiddenw>
    ...
</s>
proycon commented 5 years ago

The syntax annotation layer would be embedded in <s> and refers back to <hiddenw> using the normal <wref> mechanism. If you want to make explicit that the hidden token is a syntactic one, just invent a class for it in some set, something like:

<hiddenw class="syntactic">
luutuntin commented 5 years ago

I see. So the example above should be like this, right?:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0">
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />`
        <annotation_layer class='syntax'>
    </hiddenw>
    ...
</s>
proycon commented 5 years ago

no, like this:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0" class="syntax">
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />`
        <annotation_layer class='syntax'>
    </hiddenw>
</s>

"syntax" is then just a custom defined class in a set which you use for hidden tokens (you could also opt for something more especific like exp as a class, it's up to you). The set just needs to be declared in the document metadata:

 <hiddentoken-annotation set="http://wherever/the/set/definition/is/if/it/exists/at/all" />
luutuntin commented 5 years ago

But then we don't need <annotation_layer class='syntax'>, I assume:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0" class='syntax'>
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />`
        <!--annotation_layer='syntax'-->
    </hiddenw>
    ...
</s>

My next questions (which are not critical now) will be how do we deal with the case:

proycon commented 5 years ago

Ah yes, sorry, I accidentally left in <annotation_layer class='syntax'> from copy pasting your example. But that should go away indeed as it is not FoLiA even. :)

when the same hidden token is used in different annotation layers (as I don't think we can have )?

Multiple classes are not allowed indeed. But you can refer to the same token from multiple span annotation layers using <wref>, that's no problem. Alternatively, you can use multiple hidden token layers in different sets (but I wouldn't really recommend that as it makes things needlessly complex).

The fact you refer to a hidden token from a syntax layer should already make clear that it has a role in syntax, so putting classes on the <hiddenw> itself may already be overkill, see the following example:

<s xml:id="s.1">
    <hiddenw xml:id="s.1.w.0">
        <t>*exp*</t>
        <desc>empty expletive subject</desc>
        <pos class="EX" />`
    </hiddenw>
    ...
    <syntax>        
        <su xml:id="s.1.su.1" class="IP-MAT">
            <su xml:id="s.1.su.2" class="NP-SBJ">
                <wref id="s.1.w.0" /> <!-- reference to the hiddenw -->
            </su>
            ...
     </syntax>
      ..
     <semroles>
        <predicate xml:id="s.1.pr.1">
          <semrole xml:id="s.1.pr.1.r.1" class="AGENT">
             <wref id="s.1.w.0" /> <!-- another reference to the hiddenw -->
          </semrole>
          ...
        </predicate>
     </semroles>
luutuntin commented 5 years ago

Thank you.

luutuntin commented 5 years ago

Today, I just reviewed FoLiA documentation (PDF) - section 2.10.8 Corrections. I can see that, for example, an insertion can be a solution for introducing a hidden word/token. Is there any reason that makes this not a good solution?

proycon commented 5 years ago

That would not be an appropriate solution for hidden words, because those are not explicitly hidden words. Corrections really express "the old situation was wrong, this is how it was and this is how it should be instead", which is semantically different from what you want with hidden tokens.

luutuntin commented 5 years ago

Thank you.

proycon commented 5 years ago

This is now released as proposed with FoliA v2.0.0 and documented here as part of the new FoLiA documentation: https://folia.readthedocs.io/en/latest/hiddentoken_annotation.html

luutuntin commented 5 years ago

Great.