Closed BramVanroy closed 1 year ago
@BramVanroy apologies that this issue fell off my radar and I never gave a reply.
is this a faulty annotation in the corpus, or is there no specified order for incrementing variables, or are both our implementations incorrect?
First, the AMRs in the LDC corpora were annotated by humans using the AMR editor. I believe the variables are created as the humans enter the nodes. So in this case, it looks like the annotator first added thing as the :ARG0
of d
, thus creating t
, then added tiger (t2
), and finally added thing as the :ARG1
of w
. That is, I don't think there is any algorithm for recreating the same variable names.
Second, (de)serialization between strings and trees and trees is straightforward and would not result in any meaningful changes or non-meaningful reorderings. The only differences you might see, assuming the AMR is in fact well-formed to begin with, is with whitespace. It may be good to test this, e.g., as a unit test for this library, or as a simple check for well-formedness, but otherwise I'm not sure there's much utility for round-tripping through the tree structure.
The reset_variables()
function is just a depth-first traversal that recreates the variable names as it goes. The AMR corpus compilers could have made their representations more predictable by using a function like reset_variables()
, but, alas, they did not.
Hope this info is useful to you despite being 4 months late!
Thank you, this is indeed very useful! I settled on using reset_variables()
at the start of my serialization process just to make sure the order of variables is what I expect it to be, so thank you for that method it is very useful!
To your second point, from what I have tried so far, penman
is not very strict in which whitespaces it allows (which is good, I think). So I am not sure what you would like to test with a unit test there?
So I am not sure what you would like to test with a unit test there?
Yeah I don't think I need more tests here, although I don't actually have one to check for minimal or no spacing:
>>> penman.parse('(a/alpha:ARG0(b/beta))')
Tree(('a', [('/', 'alpha'), (':ARG0', ('b', [('/', 'beta')]))]))
My point was that outside of something like unit tests there doesn't seem to be much benefit in checking strings before and after parsing to a tree and formatting again.
Ah, that is true. But in my case I go from penman -> tree -> serialized form -> back to tree -> penman. And to assign variable names correctly, my deserializer assumes depth first assignment. As such, it will find that (as in the OP above) its deserialized penman can be slightly different than the input penman because the t, t2, t3 are differently assigned. But this is solved by using reset_variables
now.
My case is a niche case, I am sure but I am glad that you clarified the annotation process - it's good to keep in mind that there is no specific annotation "order" that annotators have to follow.
I am working on serialization of AMR trees. Part of this process is sanity checking, i.e. going from an AMR string to a penman tree1, to a serialization, back to a penman tree2 and verifying that the tree1 and tree2 are identical. This seems to work rather well for now (wip) but I encountered one thing that I cannot seem to figure out.
As a test set, I am using the AMR 3.0 corpus. One of the sentences that is giving me issues, is the following one:
The issue that I am encountering is the variable names and how they are incremented, in this case the
t0...3
. In my deserialization code, I take a depth-first approach. So in my case, the t3 and t2 in the example above would be switched: first the ARG1thing
is encountered, then the deepestthing
, and finallytiger
. In the example above, however, I do not understand how this order has been decided in annotation. I cannot seem to find a sensible pattern here.To test, I tried to use penman's
Tree.reset_variables
, but there I get the same result as my intuition. Here is a minimal test example:So my question is, really, is this a faulty annotation in the corpus, or is there no specified order for incrementing variables, or are both our implementations incorrect? I know that this is quite a general question, but the answer does have implications for the
reset_variables
method I believe.