goodmami / penman

PENMAN notation (e.g. AMR) in Python
https://penman.readthedocs.io/
MIT License
141 stars 26 forks source link

Gorn Address of a Graph Triple #91

Closed bjascob closed 4 years ago

bjascob commented 4 years ago

I have some alignment code that iterates through the triples in a graph to find the alignments between nodes (or edges) and words. This generally works and I can easily add the surface alignments (ie ~e.3) to the epidata since I already have the triple in the iteration loop. The issue is in creating the metadata alignment string (using the ISI/LDC2020T02 format). To form the string I need the Gorn address that represents the triple. I've seen #74 and I adapted the code at the end to create a lookup for instances, edges and attributes, which then could be used to find triples. Something like...

def get_addrress_dicts(tree):
    node_dict = {}
    attr_dict = {}
    edge_dict = {}
    for path, branch in tree.walk():
        # Get the node and attribute addresses
        if is_atomic(branch[1]):
            address = '.'.join(map(str, path))
            concept = branch[1]
            if concept.startswith('"'):
                attr_dict[address] = concept
            else:
                node_dict[address] = concept
        # Get the edge addresses
        if is_atomic(branch[0]) and branch[0] != '/':
            address   = '.'.join(map(str, path))
            edge_name = branch[0]
            edge_dict[address] = edge_name
    return node_dict, attr_dict, edge_dict

BTW.. the ISI convention for edges is to append a .r onto the address (ie.. 7-1.2.1.r) and use the address of the node it's attached to.

The issue is that edges and attributes names aren't unique so a lookup here by name alone, isn't sufficient. In addition, it seems like a lot of messy coding to take a graph, re-parse it into a tree, then walk it to create a bunch of dictionaries to get the addresses to match-up to the triple. Any suggestions on a better way to do this? Ideally I'd create two functions, triple_to_address() and address_to_triple() since eventually for test/eval I'll probably need to take an alignment and get the triple it represents.

In a semi-related question... are the triples ordered and specific way? Right now I'm just looping through graph.triples when searching for alignments. However, the ordering will potentially impact my alignment code for instances when there's more than one instance of a word. I think I need an ordering that follows (loosely) a depth first search.

goodmami commented 4 years ago

The issue here is that there are many ways a graph (of triples) can be arranged in a tree, so we can't really talk about tree positions until after it has been configured as such. This is why walk() is a tree method. If your alignments are to triples, I suggest adding the epidata to the graph and configuring a tree, then walking the tree to detect the in-situ alignments. You don't have to fully encode the graph to a string then reparse to a tree. The normal encoding workflow is graph -> tree -> string, so you just inspect the tree along the way. Unfortunately the functions to detect alignments on roles and atomic nodes are not part of the public API (see layout.py's _process_role() and _process_atomic()). These could be made public (in some form).

Do you need the AlignmentMarker objects, or just the alignment string (e.g., ~1, or maybe just 1)? Either way, we could probably add some function (to penman.surface?) that splits off the part after the ~, and you can use AlignmentMarker.from_string() if you need the objects. An alternative to this is to make layout.configure() take a callback function that returns the tree location of every triple as they are inserted as branches. They would not necessarily appear in order, though. This sounds like a nice, general solution, but it may also be complex and slow down processing. I don't have time at the moment to implement either of these solutions, but I'm happy to review a PR.

are the triples ordered [in a] specific way?

Yes, the order of the triples matters (and the associated epidata) only if you want to get a specific tree from a graph. In graph terms (e.g., for comparison), the order does not matter, though. If the graph has been decoded from a PENMAN string and has not been transformed, the order should be the same as a depth-first search of the tree:

>>> penman.decode('(a / alpha :ARG0 (b / beta) :polarity -)').triples
[('a', ':instance', 'alpha'), ('a', ':ARG0', 'b'), ('b', ':instance', 'beta'), ('a', ':polarity', '-')]
>>> penman.decode('(a / alpha :polarity - :ARG0 (b / beta))').triples
[('a', ':instance', 'alpha'), ('a', ':polarity', '-'), ('a', ':ARG0', 'b'), ('b', ':instance', 'beta')]
bjascob commented 4 years ago

Thanks for the great info. Getting the surface alignments by stripping the ~e.X should be fairly trivial so I don't think we need to add code to the lib for that alone. I'll try a few things and see where I end up and let you know. This part of some other work so it may be a little bit before I have something fully functional.

goodmami commented 4 years ago

Yes it's fairly trivial, but there's at least one wrinkle: the atoms can be quoted and ~ can appear inside the quotes. In this case you should look first to the last occurrence of the quote character, then look for the ~ after that.

bjascob commented 4 years ago

Just FYI in regards to the tree.walk() method, I had to make a small modification to this in order to get the addressing consistent with the LDC data. The revised function looks like

for path, branch in walk_tree(tree.node, (1,)):    #  <-- optional: change here to start at (1,)
    <code to extact nodes, attribs and edge names, etc..>

# Modified method
def walk_tree(node, path):
    var, branches = node
    for i, branch in enumerate(branches):
        if i == 0:                 # <--  change here to not add to path on initial branch
            curpath = path
        else:
            curpath = path + (i,)
        yield (curpath, branch)
        _, target = branch
        if not penman.tree.is_atomic(target):
            yield from walk_tree(target, curpath)

The decision to start at (1,) simply determines if the initial node is empty or (1,) so this is nothing significant, but I think the if i==0 statement is needed inside the walk function. I'm not positive that the original function is "wrong" but it's not consistent with how LDC did it and it gives slightly different values than what I get if I count addresses visually from the graph.

I can certainly do a quick PR for this if you want to integrate the change, however for my purposes, it's easy enough to just call the revised function in my code so I'm not sure we need to change the library if you don't think this is a bug.

goodmami commented 4 years ago

I found a wide range of conventions for these kind of alignments, with some being 1-based and others 0-based. The function is designed to be generally applicable instead of tuned to one particular convention. E.g., in Gorn addresses (according to Wikipedia) the root is not given an address and is instead indicated with *, but in AMR I often see the root node having its own address. The Tree.walk() method is meant to produce output that could be adapted to these use cases by the caller. Therefore I don't immediately see this as a bug. Perhaps Tree.walk() could take an optional parameter, but I was trying to keep things simple.

bjascob commented 4 years ago

Thanks for the help on this. I think we can close the issue since I have the alignment addresses working with the modified tree.walk() method.

In the future, if anyone is looking for the ability to turn surface alignments into LDC style alignment strings, it'll be part of the rule based aligner in amrlib.

goodmami commented 4 years ago

Ok sounds good. And thanks for the link to amrlib. It looks nice!