udapi / udapi-python

Python framework for processing Universal Dependencies data
GNU General Public License v3.0
57 stars 31 forks source link

FixPunct with non-projectively embedded clauses and its delimiters #87

Closed Stormur closed 3 years ago

Stormur commented 3 years ago

It seems that the FixPunct block is not able to treat some punctuation marks correctly, in particular the guillemets « and ». Might it be a question of sentence structure? From the following, with all punctuation moved to root and non/projective structure, I get (after this) a sentence with non/projective punctuation at token 10.

# sent_id = Mon-142
# text = Unde Phylosophus ad Nicomacum: «De hiis enim» inquit «que in passionibus et actionibus, sermones minus sunt credibiles operibus».
# reference = Liber_Primus,xiii,Paragraphus_4
1   Unde    unde    PRON    r   _   11  obl _   _
2   Phylosophus philosophus NOUN    Sms2    Case=Nom|Gender=Masc|InflClass=IndEurO|Number=Sing|Proper=Yes   11  nsubj   _   _
3   ad  ad  ADP e   AdpType=Prep    4   case    _   _
4   Nicomacum   nicomacus   PROPN   Sms2a   Case=Acc|Gender=Masc|InflClass=IndEurO|NameType=Giv|Number=Sing|Proper=Yes  11  obl _   SpaceAfter=No
5   :   :   PUNCT   Pu  _   0   punct   _   _
6   «   «   PUNCT   Pu  _   0   punct   _   SpaceAfter=No
7   De  de  ADP e   AdpType=Prep    8   case    _   _
8   hiis    hic DET ddipnb  Case=Abl|Gender=Neut|InflClass=LatPron|Number=Plur|PronType=Dem 22  obl _   _
9   enim    enim    PART    c   Emphatic=Yes    22  discourse   _   SpaceAfter=No
10  »   »   PUNCT   Pu  _   0   punct   _   _
11  inquit  inquam  VERB    va5-irs3    Aspect=Perf|InflClass=LatI2|Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin|Voice=Act 0   root    _   _
12  «   «   PUNCT   Pu  _   0   punct   _   SpaceAfter=No
13  que qui PRON    prepnn  Case=Nom|Gender=Neut|InflClass=LatPron|Number=Plur|PronType=Rel 8   acl:relcl   _   _
14  in  in  ADP e   AdpType=Prep    15  case    _   _
15  passionibus passio  NOUN    sfp3b   Case=Abl|Gender=Fem|InflClass=IndEurX|Number=Plur   13  obl _   _
16  et  et  CCONJ   co  _   17  cc  _   _
17  actionibus  actio   NOUN    sfp3b   Case=Abl|Gender=Fem|InflClass=IndEurX|Number=Plur   15  conj    _   SpaceAfter=No
18  ,   ,   PUNCT   Pu  _   0   punct   _   _
19  sermones    sermo   NOUN    smp3n   Case=Nom|Gender=Masc|InflClass=IndEurX|Number=Plur  22  nsubj   _   _
20  minus   parum   ADV r+  Degree=Cmp  22  advmod  _   _
21  sunt    sum AUX va5ipp3 Aspect=Imp|InflClass=LatAnom|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin  22  cop _   _
22  credibiles  credibilis  ADJ amp2nf  Case=Nom|Degree=Pos|Gender=Masc|InflClass=IndEurI|Number=Plur   11  parataxis:rep   _   _
23  operibus    opus    NOUN    snp3b   Case=Abl|Gender=Neut|InflClass=IndEurX|Number=Plur  22  obl:cmpr    _   SpaceAfter=No
24  »   »   PUNCT   Pu  _   0   punct   _   SpaceAfter=No
25  .   .   PUNCT   Pu  _   0   punct   _   _

After FixPunct:

# sent_id = Mon-142
# text = Unde Phylosophus ad Nicomacum: «De hiis enim» inquit «que in passionibus et actionibus, sermones minus sunt credibiles operibus».
# reference = Liber_Primus,xiii,Paragraphus_4
1   Unde    unde    PRON    r   _   11  obl _   _
2   Phylosophus philosophus NOUN    Sms2    Case=Nom|Gender=Masc|InflClass=IndEurO|Number=Sing|Proper=Yes   11  nsubj   _   _
3   ad  ad  ADP e   AdpType=Prep    4   case    _   _
4   Nicomacum   nicomacus   PROPN   Sms2a   Case=Acc|Gender=Masc|InflClass=IndEurO|NameType=Giv|Number=Sing|Proper=Yes  11  obl _   SpaceAfter=No
5   :   :   PUNCT   Pu  _   4   punct   _   _
6   «   «   PUNCT   Pu  _   8   punct   _   SpaceAfter=No
7   De  de  ADP e   AdpType=Prep    8   case    _   _
8   hiis    hic DET ddipnb  Case=Abl|Gender=Neut|InflClass=LatPron|Number=Plur|PronType=Dem 22  obl _   _
9   enim    enim    PART    c   Emphatic=Yes    22  discourse   _   SpaceAfter=No
10  »   »   PUNCT   Pu  _   8   punct   _   _
11  inquit  inquam  VERB    va5-irs3    Aspect=Perf|InflClass=LatI2|Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin|Voice=Act 0   root    _   _
12  «   «   PUNCT   Pu  _   13  punct   _   SpaceAfter=No
13  que qui PRON    prepnn  Case=Nom|Gender=Neut|InflClass=LatPron|Number=Plur|PronType=Rel 8   acl:relcl   _   _
14  in  in  ADP e   AdpType=Prep    15  case    _   _
15  passionibus passio  NOUN    sfp3b   Case=Abl|Gender=Fem|InflClass=IndEurX|Number=Plur   13  obl _   _
16  et  et  CCONJ   co  _   17  cc  _   _
17  actionibus  actio   NOUN    sfp3b   Case=Abl|Gender=Fem|InflClass=IndEurX|Number=Plur   15  conj    _   SpaceAfter=No
18  ,   ,   PUNCT   Pu  _   13  punct   _   _
19  sermones    sermo   NOUN    smp3n   Case=Nom|Gender=Masc|InflClass=IndEurX|Number=Plur  22  nsubj   _   _
20  minus   parum   ADV r+  Degree=Cmp  22  advmod  _   _
21  sunt    sum AUX va5ipp3 Aspect=Imp|InflClass=LatAnom|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin  22  cop _   _
22  credibiles  credibilis  ADJ amp2nf  Case=Nom|Degree=Pos|Gender=Masc|InflClass=IndEurI|Number=Plur   11  parataxis:rep   _   _
23  operibus    opus    NOUN    snp3b   Case=Abl|Gender=Neut|InflClass=IndEurX|Number=Plur  22  obl:cmpr    _   SpaceAfter=No
24  »   »   PUNCT   Pu  _   22  punct   _   SpaceAfter=No
25  .   .   PUNCT   Pu  _   11  punct   _   _

If the offending guillemet at 10 is attached to token 9, the non-projectivity disappears. It would be an unorthodox attachment, but I suspect the only possible one to avoid this situation.

martinpopel commented 3 years ago

Thanks for reporting. I will look at that. It is not related to guillemets (it would be the same with other paired quotes).

Stormur commented 3 years ago

Thanks for reporting. I will look at that. It is not related to guillemets (it would be the same with other paired quotes).

Yes, it is definitely not so, I don't know why I deemed it to be so important in the beginning. By the way, I have two other structurally identical sentences where FixPunct fails where there are other punctuation marks.

The problem in the sentences is that we have a (paratactical) clause split in more pieces and interrupted by the non-projective verb of speech (and the annotation is all right).

martinpopel commented 3 years ago

Sorry it took me so long. I applied the new version on several big treebanks and studied the differences. My original version seemed intuitively "more correct" to me because it usually attached the opening and closing punctuation to the same node, which was indeed the head of the quoted/parenthesized phrase. However, it resulted in non-projectivities, which are forbidden when following strictly the guidelines for punct. So I adapted the code.

Stormur commented 3 years ago

Thanks! I tested it again on my conllu files and now it passes all validations!