goodmami / penman

PENMAN notation (e.g. AMR) in Python
https://penman.readthedocs.io/
MIT License
135 stars 26 forks source link

Tree.reset_variables: order of incrementation #112

Closed BramVanroy closed 1 year ago

BramVanroy commented 1 year ago

I am working on serialization of AMR trees. Part of this process is sanity checking, i.e. going from an AMR string to a penman tree1, to a serialization, back to a penman tree2 and verifying that the tree1 and tree2 are identical. This seems to work rather well for now (wip) but I encountered one thing that I cannot seem to figure out.

As a test set, I am using the AMR 3.0 corpus. One of the sentences that is giving me issues, is the following one:

# ::id bolt12_64545_0526.1 ::date 2012-12-23T18:47:13 ::annotator SDL-AMR-09 ::preferred
# ::snt There are many who have a sense of urgency, quietly watching how things develop,you are dragons coiling, you are tigers crouching, I admire noble-minded patriots.
# ::save-date Fri Nov 3, 2017 ::file bolt12_64545_0526_1.txt
(m / multi-sentence
      :snt1 (m2 / many
            :ARG0-of (s / sense-01
                  :ARG1 (u / urgency)
                  :time (w / watch-01
                        :ARG0 m2
                        :ARG1 (t3 / thing
                              :manner-of (d / develop-02
                                    :ARG0 (t / thing)))
                        :manner (q / quiet-04
                              :ARG1 m2))))
      :snt2 (d2 / dragon
            :domain (y / you)
            :ARG0-of (c / coil-01))
      :snt3 (t2 / tiger
            :domain (y2 / you)
            :ARG0-of (c2 / crouch-01))
      :snt4 (a / admire-01
            :ARG0 (i / i)
            :ARG1 (p / patriot
                  :poss-of (m3 / mind
                        :mod (n / noble)))))

The issue that I am encountering is the variable names and how they are incremented, in this case the t0...3. In my deserialization code, I take a depth-first approach. So in my case, the t3 and t2 in the example above would be switched: first the ARG1 thing is encountered, then the deepest thing, and finally tiger. In the example above, however, I do not understand how this order has been decided in annotation. I cannot seem to find a sensible pattern here.

To test, I tried to use penman's Tree.reset_variables, but there I get the same result as my intuition. Here is a minimal test example:

from copy import deepcopy

import penman

penman_str = """
(m / multi-sentence
   :snt1 (m2 / many
             :ARG0-of (s / sense-01
                         :ARG1 (u / urgency)
                         :time (w / watch-01
                                  :ARG0 m2
                                  :ARG1 (t3 / thing
                                            :manner-of (d / develop-02
                                                          :ARG0 (t / thing)))
                                  :manner (q / quiet-04
                                             :ARG1 m2))))
   :snt2 (d2 / dragon
             :domain (y / you)
             :ARG0-of (c / coil-01))
   :snt3 (t2 / tiger
             :domain (y2 / you)
             :ARG0-of (c2 / crouch-01))
   :snt4 (a / admire-01
            :ARG0 (i / i)
            :ARG1 (p / patriot
                     :poss-of (m3 / mind
                                  :mod (n / noble)))))

"""

if __name__ == '__main__':
    tree = penman.parse(penman_str)
    reset_tree = deepcopy(tree)
    reset_tree.reset_variables()

    print(penman.format(tree))
    print(penman.format(reset_tree))
    assert tree == reset_tree

So my question is, really, is this a faulty annotation in the corpus, or is there no specified order for incrementing variables, or are both our implementations incorrect? I know that this is quite a general question, but the answer does have implications for the reset_variables method I believe.

goodmami commented 1 year ago

@BramVanroy apologies that this issue fell off my radar and I never gave a reply.

is this a faulty annotation in the corpus, or is there no specified order for incrementing variables, or are both our implementations incorrect?

First, the AMRs in the LDC corpora were annotated by humans using the AMR editor. I believe the variables are created as the humans enter the nodes. So in this case, it looks like the annotator first added thing as the :ARG0 of d, thus creating t, then added tiger (t2), and finally added thing as the :ARG1 of w. That is, I don't think there is any algorithm for recreating the same variable names.

Second, (de)serialization between strings and trees and trees is straightforward and would not result in any meaningful changes or non-meaningful reorderings. The only differences you might see, assuming the AMR is in fact well-formed to begin with, is with whitespace. It may be good to test this, e.g., as a unit test for this library, or as a simple check for well-formedness, but otherwise I'm not sure there's much utility for round-tripping through the tree structure.

The reset_variables() function is just a depth-first traversal that recreates the variable names as it goes. The AMR corpus compilers could have made their representations more predictable by using a function like reset_variables(), but, alas, they did not.

Hope this info is useful to you despite being 4 months late!

BramVanroy commented 1 year ago

Thank you, this is indeed very useful! I settled on using reset_variables() at the start of my serialization process just to make sure the order of variables is what I expect it to be, so thank you for that method it is very useful!

To your second point, from what I have tried so far, penman is not very strict in which whitespaces it allows (which is good, I think). So I am not sure what you would like to test with a unit test there?

goodmami commented 1 year ago

So I am not sure what you would like to test with a unit test there?

Yeah I don't think I need more tests here, although I don't actually have one to check for minimal or no spacing:

>>> penman.parse('(a/alpha:ARG0(b/beta))')
Tree(('a', [('/', 'alpha'), (':ARG0', ('b', [('/', 'beta')]))]))

My point was that outside of something like unit tests there doesn't seem to be much benefit in checking strings before and after parsing to a tree and formatting again.

BramVanroy commented 1 year ago

Ah, that is true. But in my case I go from penman -> tree -> serialized form -> back to tree -> penman. And to assign variable names correctly, my deserializer assumes depth first assignment. As such, it will find that (as in the OP above) its deserialized penman can be slightly different than the input penman because the t, t2, t3 are differently assigned. But this is solved by using reset_variables now.

My case is a niche case, I am sure but I am glad that you clarified the annotation process - it's good to keep in mind that there is no specific annotation "order" that annotators have to follow.