udapi / udapi-python

Python framework for processing Universal Dependencies data
GNU General Public License v3.0
57 stars 31 forks source link

Multiple identical CorefMentions in `node.coref_mentions`. #106

Closed foxik closed 2 years ago

foxik commented 2 years ago

Hi,

is it expected that the first word of a mention contains an identical CorefMention twice in the node.coref_mentions? To show what I mean, the first two words from English-GUM

1       Aesthetic       aesthetic       ADJ     JJ      Degree=Pos      2       amod    2:amod  Discourse=organization-heading:1->57:8|Entity=(e1-abstract-2--new-2-sgl
2       Appreciation    appreciation    NOUN    NN      Number=Sing     0       root    0:root  Entity=e1)

processed by the code

    data = udapi.block.read.conllu.Conllu(files=..., split_docs=True).read_documents()
    for doc in data:
        for node in doc.nodes_and_empty:
            print(node.coref_mentions)

returns for example

[<udapi.core.coref.CorefMention object at 0x14dea7b8a590>, <udapi.core.coref.CorefMention object at 0x14dea7b8a590>]
[<udapi.core.coref.CorefMention object at 0x14dea7b8a590>]

It is cased by the CorefMention being added both during construction and via explicit node._mentions.append in the lines https://github.com/udapi/udapi-python/blob/f3b8689bffdccd0cf608423b8f50deaee0419207/udapi/core/coref.py#L645-L650

Maybe this is expected, but I was surprised by it and could not find any mention about it in the docs.

If it is not expected, an obvious fix is to pass add_word_backlinks=False to the mentioned CorefMention constructor call.

Thanks & cheers!

martinpopel commented 2 years ago

Yes, this was a bug. Thanks for spotting it and exactly identifying the cause.