Ability to annotate elliptical elements (empty entities)

rodrigallardo commented 6 months ago

Is your feature request related to a problem? Please describe. I'm trying to use this awesome tool for a coreference annotation project on Spanish text. Spanish is a pro-drop language that allows for pronominal subjects to be dropped. Here's an example:

Pablo es un experto en artes marciales. De hecho, ø venció a Batman. English: Pablo is an expert on martial arts. In fact, (he) beat Batman.

Here, the ø symbol marks an empty space where the pronoun "he" was dropped.

On the coreference task, the ability to annotate the spans "Pablo" and ø as coreferent is necessary.

Describe the solution you'd like

I think one possible solution would be to add the ability to highlight the space between tokens on the annotation tool and to select said space as a span or entity, that can be linked to other spans or entities.

Describe alternatives you've considered

I have considered adding special blank characters (I tried Ux2000, Ux2004 and Ux200C) in between the tokens of the text, to see if that would allow me to highlight those blank characters as a span or entity, but this didn't work.

I have also considered adding non-blank characters, but I feel that's a very noisy option, since the text will be fully covered with those tokens, since an elliptical element could be anywhere.

reckart commented 6 months ago

Have you tried this?

To create a zero-length annotation, hold Shift and click on the position where you wish to create the annotation. To avoid accidental creations of zero-length annotations, a simple single-click triggers no action by default. The lock to token behavior cancels the ability to create zero-length annotations.

Source: https://inception-project.github.io/releases/31.4/docs/user-guide.html#_spans

rodrigallardo commented 6 months ago

Awesome! I didn't know of that functionality.

I'm trying to use the pre-defined Coreference layer. However, these types of zero-length spans don't seem to be exported correctly into the CoNLL formats or the inline XML format. Do you know if there's a solution for this?

Also, the pre-defined Coreference layer, which I find very useful, is not being exported into the CoNLL-U format, which is now the massively accepted format for Coreference Annotation.

reckart commented 6 months ago

I'm trying to use the pre-defined Coreference layer. However, these types of zero-length spans don't seem to be exported correctly into the CoNLL formats or the inline XML format. Do you know if there's a solution for this?

The CoNLL-U reader and writer currently do not support empty nodes. For a workaround, see below.

Also, the pre-defined Coreference layer, which I find very useful, is not being exported into the CoNLL-U format, which is now the massively accepted format for Coreference Annotation.

The CoNLL-U reader and writer would need to be extended by a developer to support the CoNLL-U coreference information.

As a user (with some Python experience) you could probably export the data as UIMA CAS JSON, use dkpro-cassis to load the data and write a script to transform it.

However, note that certain features in INCEpTION are presently not available for chain layers (like coreference), e.g. recommenders, agreement calculation and curation. If you need any of these, you are probably better off modelling your coreference task using a combination of a custom span and relation layer. And when you do that, the built-in CoNLL-U writer won't work anyway, so you would be back again at writing a Python script.

I have also considered adding non-blank characters, but I feel that's a very noisy option, since the text will be fully covered with those tokens, since an elliptical element could be anywhere.

If I remember correctly, a CoNLL-U empty node has the same expressiveness as a token. I.e. you can assign a POS, Lemma, it can participate in dependency trees (or entities) etc. In INCEpTION, tokens are currently not editable. They must be present either at the time of the import (i.e. the format imported from must already define the tokens) or the tokens are automatically generated by INCEpTION. That means, if INCEpTION did support CoNLL-U empty nodes, you would already have to import a CoNLL-U files containing these empty nodes. Unless INCEpTION adds support for editing Tokens, empty nodes could not be added during annotation.

rodrigallardo commented 6 months ago

Thanks for all the suggestions and details. This is very helpful!

I agree that the best thing might be to build a custom script to transform the exported data into CoNLL-U.

Thank you very much!

inception-project / inception

Ability to annotate elliptical elements (empty entities) #4714