goodmami / penman

PENMAN notation (e.g. AMR) in Python
https://penman.readthedocs.io/
MIT License
139 stars 27 forks source link

Position of a node #74

Closed rafaelanchieta closed 4 years ago

rafaelanchieta commented 4 years ago

Hi Goodman.

Does the penman library have a method to return the position of each node in the graph? Ex. (w / want-01 :ARG0 (b / boy) :ARG1 (g / go-01 : ARG0 b)) want-01 - 0 boy - 0.0 go-01 - 0.1

goodmami commented 4 years ago

There isn't currently a function to do this, but it wouldn't be hard to do, and it would probably make a good addition to the library so it could work with JAMR-style alignment metadata. But you would want to target the tree and not the graph (e.g., from PENMANCodec.parse(), also see the penman.tree module), as those are tree positions. The trouble is that if the tree gets restructured from the pure graph, those alignments no longer make sense.

Depending on what you want, it would probably be as simple as the following (not tested):

def positions(node):
    variable, branches = node
    pos = []
    for i, (role, target) in enumerate(branches):
        pos.append((i, target))
        if not tree.is_atomic(target):
            pos.extend((f'{i}.{j}', tgt) for j, tgt in positions(target))
    return pos

I would welcome a PR with a Tree method that does something similar to the above, or maybe I'll have a chance to do it for the next release.

rafaelanchieta commented 4 years ago

Thank you for the quick reply. I'm trying to fix the code.

goodmami commented 4 years ago

Sounds good. There's some useful documentation of the numbering here, and #19 has further investigation and discussion of the numbering and alignment formats. I'm on the fence as to whether we should have the position start at 0 or 1 or make it a parameter. I kinda like 0-based and in depth-first order so they can be zipped with the output of Tree.nodes(). Users who want 1-based can always add 1 when creating alignment metadata, I suppose?

rafaelanchieta commented 4 years ago

I believe that the position should start at 0 to keep the compatibility with the JARM-alignment and with parsers based on it. I'll see the documentation.

goodmami commented 4 years ago

Also see this document from the 3.0 release of the AMR data. It uses 1-based indices.

https://catalog.ldc.upenn.edu/docs/LDC2020T02/AMR-alignment-format.txt

rafaelanchieta commented 4 years ago

Interesting. Are there more examples of this alignment format?

goodmami commented 4 years ago

I'm not sure. In this comment of #19 I investigated some sources and found 3 different in-situ formats (~N, ~eN, ~e.N) and two metadata formats (M-N|i+i.j and M-i.j). Penman already supports all the in-situ formats (shallowly, it just stores whatever comes between ~ and the first index as a prefix, and there's no interpretation of the indices yet, so 0-based and 1-based are both ok).

As mentioned, I'm happy if as a first pass the Tree.positions() method (or whatever it's called) only returns 0-based indices because we can add 1 when creating metadata. If we start interpreting the ::alignment metadata lines or in-situ alignments (e.g., for extracting token sequences from the ::tok line) we'll need to make up a story for interpreting the values.

goodmami commented 4 years ago

@rafaelanchieta I don't see a fork of Penman on your GitHub page so am I correct in assuming you're not working on this feature (at least not for submitting a PR)?

goodmami commented 4 years ago

I went ahead and pushed a new branch with the Tree.positions() method. I went with the starting number of 1 to follow the AMR annotation guidelines, but it takes an argument to adjust this. Here's how it works:

>>> from penman.codec import PENMANCodec
>>> c = PENMANCodec()
>>> t = c.parse('(w / want-01 :polarity - :ARG0 (b / boy) :ARG1 (g / go-01 :ARG0 b))')
>>> print(c.format(t))
(w / want-01
   :polarity -
   :ARG0 (b / boy)
   :ARG1 (g / go-01
            :ARG0 b))
>>> t.positions()
['1', '1.2', '1.3']
>>> for node, pos in zip(t.nodes(), t.positions()):
...   print(pos, node[0])
... 
1 w
1.2 b
1.3 g
>>> t.positions(0)
['0', '0.1', '0.2']

Note that the positions are only returned for actual nodes, so the (w :polarity -) triple increments the counter but does not end up in the list of positions. This is so the result of Tree.positions() can be zipped with Tree.nodes().

@rafaelanchieta, does this suit your needs?

rafaelanchieta commented 4 years ago

Hi @goodmami .

I downloaded the Penman project and was working on it. For that example, the (w :polarity -) triple should be counted. The method should returns ['0', '0.0, '0.1', '0.2']'

goodmami commented 4 years ago

Thanks for the feedback. If you want to submit a PR I can still merge it in (even if it gets some further adjustments) so your contribution is recognized by GitHub.

Regarding the (w :polarity -) triple, I chose to count but not return a position for those triples because if the list of positions could not zip with the nodes they would be rather useless. You would have to do something complicated to figure out which node each position went with, and that defeats the purpose of computing them in the first place.

But I just now realized that tree nodes and positions won't help with the main use case of computing the positions, which is to create or interpret alignment metadata, because when a tree is configured the alignment data is already embedded in the tree and thus not easily inspected:

>>> from penman.codec import PENMANCodec
>>> from penman import layout
>>> from penman.surface import alignments, role_alignments
>>> c = PENMANCodec()
>>> g = c.decode('''
... # ::tok I swam in the lake .
... (s / swim~e.1
...    :ARG0 (i / i~e.0)
...    :location~e.2 (l / lake~e.4))''')
>>> alignments(g)
{('s', ':instance', 'swim'): Alignment((1,), prefix='e.'), ('i', ':instance', 'i'): Alignment((0,), prefix='e.'), ('l', ':instance', 'lake'): Alignment((4,), prefix='e.')}
>>> role_alignments(g)
{('s', ':location', 'l'): RoleAlignment((2,), prefix='e.')}
>>> layout.configure(g)
Tree(('s', [('/', 'swim~e.1'), (':ARG0', ('i', [('/', 'i~e.0')])), (':location~e.2', ('l', [('/', 'lake~e.4')]))]))

So perhaps what is needed instead is a function that maps triples to tree positions. I'm not sure if this would be a Tree method or a function of penman.layout, since it is strongly related to penman.layout.configure().

rafaelanchieta commented 4 years ago

I'm not sure but, checking if the attribute-is in the attribute list could solve this issue.

goodmami commented 4 years ago

Thanks, but I don't think that's the issue. It's that alignments are associated with graph triples, but the positions are based on the tree structure. The graph, tree, and PENMAN string are all separate representations of an AMR (see here if you find that confusing). Associating alignments with tree positions thus requires some function that straddles both the graph and tree.

goodmami commented 4 years ago

Ok, I've reconsidered this and changed Tree.positions() to Tree.walk() which yields paths to branches. That is, it should yield one pair for every branch in the tree structure. This is still not exactly the same as the alignment annotations, but this is more general and should be able to be applied to alignment annotation conversion. Here's an example:

>>> import penman
>>> t = penman.parse('''
... # ::tok I swam in the lake .
... (s / swim-01~e.1
...    :ARG0 (i / i~e.0)
...    :location~e.2 (l / lake~e.4))''')
>>> for path, branch in t.walk():
...     print(path, branch)
... 
(0,) ('/', 'swim-01~e.1')
(1,) (':ARG0', ('i', [('/', 'i~e.0')]))
(1, 0) ('/', 'i~e.0')
(2,) (':location~e.2', ('l', [('/', 'lake~e.4')]))
(2, 0) ('/', 'lake~e.4')

Note that the top node doesn't have an index, because these are indices of branches. They are 0-based but the concept is included as a branch, so it looks like 1-based indices. Here's an example of how you could get most of the way to the alignment metadata format:

>>> from penman.tree import is_atomic
>>> for path, branch in t.walk():
...     if is_atomic(branch[1]):
...         if path[-1] == 0:
...             path = path[:-1]
...         path = '.'.join(map(str, (1,) + path))
...         print(path, branch[1])
... 
1 swim-01~e.1
1.1 i~e.0
1.2 lake~e.4

To get the rest of the way would require parsing the alignment info from the roles or concepts. The AlignmentMarker.from_string() function would help here. Also one would need to consider role alignments on branches whose target is not atomic.

@rafaelanchieta does this look better?

rafaelanchieta commented 4 years ago

It's great. I'll close the issue.