Closed dan-zeman closed 2 years ago
In the problematic run (with the conversion blocks), when iterating coreference mentions in corefud.MergeSameSpan
, only two mentions are found. These are multi-word mentions; single word mentions are invisible.
2022-02-10 11:37:07,084 [ INFO] process_tree - mA: CESSCASTP20020501120s5.sn.47 ['24:un', '25:25%', '26:nuestro']
2022-02-10 11:37:07,084 [ INFO] process_tree - mB: CESSCASTP20020501120s5.sn.34 ['17:un', '18:75%', '19:a', '20:favor', '21:de', '22:ellos']
I'll investigate it. Thanks for reporting.
The block
corefud.MergeSameSpan
should be relatively independent of what the other blocks do. It collects mentions in a sentence (asking fornode.coref_mentions
at every node), then extracts the list of words for each mention m (set(m.words)
), and looks for pairs of mentions that span the same set of words. Nevertheless, there is a mysterious bug that triggers when this block is combined with the conversion from the old CorefUD format. Consider this sentence and especially line 26, nuestro:There are two single-word mention annotations coming from the original data. One of them (c3) is coreferential with the empty subject 10.1. It is also coreferential with 6 mentions in other sentences. The other (CESS-CAST-P-20020501-120-s5.p.54) is a singleton and it probably appeared there because of some named entity annotation at another (higher or lower) constituent.
Now when I run this scenario:
the two mentions on line 26 are not merged:
However, when I save the file and re-read it with another Udapi process, the block succeeds in merging the spans (unfortunately it picks the second mention as the survivor and thus breaks coreference with 10.1 and with the antecedents in other sentences, but that's another issue):