udapi / udapi-python

Python framework for processing Universal Dependencies data
GNU General Public License v3.0
57 stars 31 forks source link

Corefud.MoveHead seems to work non-deterministically #100

Closed dan-zeman closed 2 years ago

dan-zeman commented 2 years ago

I routinely use git diff to check whether the most recent change in processing had the intended impact on the data. Recently I noticed a number of spurious mention head changes after every change I did, and the changes had nothing obvious to do with the changes I did in conversion code. So I added corefud.MoveHead to the scenario but the problem is still there.

To test it, I ran the same scenario on the same input (cs_pcedt-ud-dev.conllu) three times in a row. The output of the first two runs was identical (showing only the second one below) but the output of the third one was different.

[20:09:20]sol5:/net/work/people/zeman/hamledt/normalize/cs-pcedt(master)> udapy -s read.OldCorefUD corefud.FixInterleaved corefud.MergeSameSpan corefud.MoveHead < cs_pcedt-ud-dev.conllu > /net/work/people/zeman/unidep/UD_Czech-PCEDT/cs_pcedt-ud-dev.conllu
2022-02-08 20:10:22,165 [   INFO] execute -  ---- ROUND ----
2022-02-08 20:10:22,165 [   INFO] execute - Executing block read.OldCorefUD
2022-02-08 20:10:26,406 [   INFO] execute - Executing block corefud.FixInterleaved
2022-02-08 20:10:26,656 [   INFO] execute - Executing block corefud.MergeSameSpan
2022-02-08 20:10:26,872 [   INFO] execute - Executing block corefud.MoveHead
2022-02-08 20:10:26,994 [   INFO] execute - Executing block write.Conllu
2022-02-08 20:10:30,075 [   INFO] process_end - corefud.MoveHead overview of mentions:
2022-02-08 20:10:30,076 [   INFO] process_end -            total =  24968 (100.0%)
2022-02-08 20:10:30,076 [   INFO] process_end -      single-word =  12451 ( 49.9%)
2022-02-08 20:10:30,076 [   INFO] process_end -          treelet =  10119 ( 40.5%)
2022-02-08 20:10:30,076 [   INFO] process_end -     treelet-kept =   9916 ( 39.7%)
2022-02-08 20:10:30,076 [   INFO] process_end -       nontreelet =   2059 (  8.2%)
2022-02-08 20:10:30,076 [   INFO] process_end -  nontreelet-kept =   1697 (  6.8%)
2022-02-08 20:10:30,076 [   INFO] process_end - nontreelet-moved =    362 (  1.4%)
2022-02-08 20:10:30,076 [   INFO] process_end -            gappy =    339 (  1.4%)
2022-02-08 20:10:30,076 [   INFO] process_end -      gappy-moved =    267 (  1.1%)
2022-02-08 20:10:30,076 [   INFO] process_end -    treelet-moved =    203 (  0.8%)
2022-02-08 20:10:30,076 [   INFO] process_end -       gappy-kept =     72 (  0.3%)
[20:10:30]sol5:/net/work/people/zeman/hamledt/normalize/cs-pcedt(master)> udapy -s read.OldCorefUD corefud.FixInterleaved corefud.MergeSameSpan corefud.MoveHead < cs_pcedt-ud-dev.conllu > /net/work/people/zeman/unidep/UD_Czech-PCEDT/cs_pcedt-ud-dev.conllu
2022-02-08 20:11:38,993 [   INFO] execute -  ---- ROUND ----
2022-02-08 20:11:38,993 [   INFO] execute - Executing block read.OldCorefUD
2022-02-08 20:11:43,347 [   INFO] execute - Executing block corefud.FixInterleaved
2022-02-08 20:11:43,557 [   INFO] execute - Executing block corefud.MergeSameSpan
2022-02-08 20:11:43,769 [   INFO] execute - Executing block corefud.MoveHead
2022-02-08 20:11:43,914 [   INFO] execute - Executing block write.Conllu
2022-02-08 20:11:47,032 [   INFO] process_end - corefud.MoveHead overview of mentions:
2022-02-08 20:11:47,032 [   INFO] process_end -            total =  24968 (100.0%)
2022-02-08 20:11:47,032 [   INFO] process_end -      single-word =  12451 ( 49.9%)
2022-02-08 20:11:47,032 [   INFO] process_end -          treelet =  10119 ( 40.5%)
2022-02-08 20:11:47,032 [   INFO] process_end -     treelet-kept =   9916 ( 39.7%)
2022-02-08 20:11:47,032 [   INFO] process_end -       nontreelet =   2059 (  8.2%)
2022-02-08 20:11:47,032 [   INFO] process_end -  nontreelet-kept =   1697 (  6.8%)
2022-02-08 20:11:47,032 [   INFO] process_end - nontreelet-moved =    362 (  1.4%)
2022-02-08 20:11:47,032 [   INFO] process_end -            gappy =    339 (  1.4%)
2022-02-08 20:11:47,033 [   INFO] process_end -      gappy-moved =    268 (  1.1%)
2022-02-08 20:11:47,033 [   INFO] process_end -    treelet-moved =    203 (  0.8%)
2022-02-08 20:11:47,033 [   INFO] process_end -       gappy-kept =     71 (  0.3%)

And the git diffs on the result (there was no commit in the meantime, so both diffs are against the same base):

[20:09:34]zen:/net/work/people/zeman/unidep/UD_Czech-PCEDT(dev *)> git diff cs_pcedt-ud-dev.conllu
diff --git a/cs_pcedt-ud-dev.conllu b/cs_pcedt-ud-dev.conllu
index 01e7a91..7607889 100644
--- a/cs_pcedt-ud-dev.conllu
+++ b/cs_pcedt-ud-dev.conllu
@@ -75070,9 +75070,9 @@
 # orig_file_sentence wsj0118#131
 1      To      ten     DET     PDNS1---------- Case=Nom|Gender=Neut|Number=Sing|PronType=Dem   2       nsubj   2:nsubj Entity=(wsj0118001c173--1-gstype:spec)|MentionHead=1|MentionText=To
 1.1    on      #PersPron       PRON    _       Case=Nom|Number=Sing|Person=3|PronType=Prs      _       _       2:nsubj Entity=(wsj0118001c173--1-gstype:spec)|Functor=ACT|MentionHead=1.1|MentionText=on
-1.2    někoho  #PersPron       PRON    _       Case=Acc|PronType=Prs   _       _       2:obj   Entity=(wsj0118001c174[1/3]--4-gstype:spec)|Functor=PAT|MentionHead=1.2,12,17|MentionText=někoho mnohem menší, než o jaký usiluje většina tradičních sběračů akcií jako hlavní cíl své práce
+1.2    někoho  #PersPron       PRON    _       Case=Acc|PronType=Prs   _       _       2:obj   Entity=(wsj0118001c174[1/3]--1-gstype:spec)|Functor=PAT|MentionHead=1.2,12,17|MentionText=někoho mnohem menší, než o jaký usiluje většina tradičních sběračů akcií jako hlavní cíl své práce
 2      znamená znamenat        VERB    VB-S---3P-AAI-- Mood=Ind|Number=Sing|Person=3|Polarity=Pos|Tense=Pres|VerbForm=Fin|Voice=Act    0       root    0:root  _
-3      velmi   velmi   ADV     Db------------- _       4       advmod  4:advmod        Entity=(wsj0118001c174[2/3]--4-gstype:spec
+3      velmi   velmi   ADV     Db------------- _       4       advmod  4:advmod        Entity=(wsj0118001c174[2/3]--1-gstype:spec
 4      malý    malý    ADJ     AAIS4----1A---- Animacy=Inan|Case=Acc|Degree=Pos|Gender=Masc|Number=Sing|Polarity=Pos   5       amod    5:amod  _
 5      zisk    zisk    NOUN    NNIS4-----A---- Animacy=Inan|Case=Acc|Gender=Masc|Number=Sing|Polarity=Pos      2       obj     2:obj   MentionHead=5|MentionText=velmi malý zisk "navíc
 6      "       "       PUNCT   Z:------------- _       5       punct   5:punct SpaceAfter=No
@@ -75080,7 +75080,7 @@
 8      "       "       PUNCT   Z:------------- _       5       punct   5:punct SpaceAfter=No
 9      ,       ,       PUNCT   Z:------------- _       2       punct   2:punct _
 10     bezesporu       bezesporu       PART    TT------------- _       12      advmod  12:advmod       _
-11     mnohem  mnohem  ADV     Db------------- _       12      advmod  12:advmod       Entity=(wsj0118001c174[3/3]--4-gstype:spec
+11     mnohem  mnohem  ADV     Db------------- _       12      advmod  12:advmod       Entity=(wsj0118001c174[3/3]--1-gstype:spec
 12     menší   malý    ADJ     AAIS4----2A---- Animacy=Inan|Case=Acc|Degree=Cmp|Gender=Masc|Number=Sing|Polarity=Pos   2       dep     2:dep   SpaceAfter=No
 13     ,       ,       PUNCT   Z:------------- _       17      punct   17:punct        _
 14     než     než     SCONJ   J,------------- _       17      mark    17:mark LId=než-2
[20:11:04]zen:/net/work/people/zeman/unidep/UD_Czech-PCEDT(dev *)> git diff cs_pcedt-ud-dev.conllu
diff --git a/cs_pcedt-ud-dev.conllu b/cs_pcedt-ud-dev.conllu
index 01e7a91..c1ad519 100644
--- a/cs_pcedt-ud-dev.conllu
+++ b/cs_pcedt-ud-dev.conllu
@@ -36181,7 +36181,7 @@
 # sent_id = wsj0071-001-p1s30
 # text = Některá mladší vína, dokonce i ta za 90 až 100 dolarů za láhev, jsou téměř zadarmo."
 # orig_file_sentence wsj0071#31
-1      Některá některý DET     PZNP1---------- Case=Nom|Gender=Neut|Number=Plur|PronType=Ind   3       det     3:det   Entity=(wsj0071001c30--7-gstype:spec
+1      Některá některý DET     PZNP1---------- Case=Nom|Gender=Neut|Number=Plur|PronType=Ind   3       det     3:det   Entity=(wsj0071001c30--3-gstype:spec
 2      mladší  mladý   ADJ     AANP1----2A---- Case=Nom|Degree=Cmp|Gender=Neut|Number=Plur|Polarity=Pos        3       amod    3:amod  _
 3      vína    víno    NOUN    NNNP1-----A---- Case=Nom|Gender=Neut|Number=Plur|Polarity=Pos   16      nsubj   16:nsubj        MentionHead=3,5,6|MentionText=Některá mladší vína, dokonce i|SpaceAfter=No
 4      ,       ,       PUNCT   Z:------------- _       3       punct   3:punct _
dan-zeman commented 2 years ago

Actually, I now see that the non-determinism may be caused by the other blocks I have in the scenario. Still investigating.

martinpopel commented 2 years ago

I don't think corefud.MoveHead is non-deterministic. It just conservatively keeps the original head if it is one of the acceptable heads in the enhanced graph (and there was no head in the basic dependencies).

dan-zeman commented 2 years ago

I don't think corefud.MoveHead is non-deterministic. It just conservatively keeps the original head if it is one of the acceptable heads in the enhanced graph (and there was no head in the basic dependencies).

So is there a way to say that the original head should not play any role? Or should I always insert this before corefud.MoveHead?

util.Eval node='for m in node.coref_mentions: m.head = m.words[0]'
martinpopel commented 2 years ago

So now both is possible: either the util.Eval code or corefud.MoveHead keep_head_if_possible=0.

dan-zeman commented 2 years ago

Great, thanks!