udapi / udapi-python

Python framework for processing Universal Dependencies data
GNU General Public License v3.0
57 stars 31 forks source link

Possible wrong #92

Open wellington36 opened 3 years ago

wellington36 commented 3 years ago

When running the following command

cat *.conllu | udapy -q util.Eval node='if (node.upos == "ADJ" and node.deprel == "amod" and node.parent.upos == "NOUN" and (node.feats["Gender"] != node.parent.feats["Gender"] or node.feats["Number"] != node.parent.feats["Number"])): node.parent.parent.draw(attributes="form,upos,feats,deprel")'

From the output we get

# sent_id = CP458-6#7
# text = de normas sociais einversa e complementarmente, práticas sociais que avaliam do grau de integração de cada um
 ╭─╼ de ADP _ case
─┾ normas NOUN Gender=Fem|Number=Plur obl
 ┡─╼ sociais ADJ Gender=Fem|Number=Plur amod
 │ ╭─╼ e CCONJ _ cc
 │ ┢─┮ inversa ADJ Gender=Fem|Number=Sing amod
 │ │ │ ╭─╼ e CCONJ _ cc
 │ │ ╰─┶ complementarmente ADV _ conj
 │ ┢─╼ , PUNCT _ punct
 ╰─┾ práticas NOUN Gender=Fem|Number=Plur conj
   ┡─╼ sociais ADJ Gender=Fem|Number=Plur amod
   │ ╭─╼ que PRON Gender=Fem|Number=Plur|PronType=Rel nsubj
   ╰─┾ avaliam VERB Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin acl:relcl
     │ ╭─╼ de ADP _ case
     │ ┢─╼ o DET Definite=Def|Gender=Masc|Number=Sing|PronType=Art det
     ╰─┾ grau NOUN Gender=Masc|Number=Sing obj
       │ ╭─╼ de ADP _ case
       ╰─┾ integração NOUN Gender=Fem|Number=Sing nmod
         │ ╭─╼ de ADP _ case
         ╰─┾ cada DET Gender=Masc|Number=Sing nmod
           ╰─╼ um NUM NumType=Card fixed

However, the "text" value is not part of the respective conllu

text = Uma verdade subjectiva incorporada através de normas sociais e, inversa e complementarmente, práticas sociais que avaliam do grau de integração de cada um.

Udapy replaced "e, inversa" for "eiversa".

martinpopel commented 3 years ago

node.draw() prints a subtree rooted in node. With the default setting print_text=True, also the # text value represents only the word forms of the subtree, not the whole tree. In your case, you printed the subtree rooted in "normas", but the comma between "e" and "inversa" is not part of the subtree. The node "e" has SpaceAfter=No in the MISC column, so it is printed without any space after the token. I admit, it may be better if SpaceAfter=No is ignored when printing a subtree and when there is a "gap" - a PR is welcome.

It may be better for your purposes, if you just print the whole tree with a given node (i.e. the ADJ node, not its grandparent) highlighted:

cat *.conllu | udapy -TMA util.Mark node='node.upos == "ADJ" and node.deprel == "amod" and node.parent.upos == "NOUN" and (node.feats["Gender"] != node.parent.feats["Gender"] or node.feats["Number"] != node.parent.feats["Number"])' | less -R

Another solution would be to keep using util.Eval, but print the whole tree with node.root.draw(). You can additionally highlight any node with node.misc["Mark"]=1 (which is what is done internally in util.Mark), but util.Eval uses Python eval() which takes a single Python expression. So you would need to convert the solution from util.Eval one-liner to a full Udapi block.

arademaker commented 3 years ago

Thank you @martinpopel for your detailed explanation. @wellington36 is working with me, hope at some point he can eventually collaborate with udapi.