goodmami / penman

PENMAN notation (e.g. AMR) in Python
https://penman.readthedocs.io/
MIT License
139 stars 27 forks source link

Linking node back to input sentence. #98

Closed JosephGatto closed 3 years ago

JosephGatto commented 3 years ago

Is there a way to identify which word in an input sentence that a node/variable is referring to?

goodmami commented 3 years ago

If the source data has in-situ surface alignments, then yes, although I admit it's not the most straightforward. Here is an example graph with the tokenized sentence in the metadata:

>>> import penman
>>> g = penman.decode('''
... # ::snt The cat slept .
... (s / sleep-01~3
...    :ARG0 (c / cat~2))
... ''')

The in-situ alignments are things like ~3 at the end of the concept. The penman.surface module has some functions to help with this, and the sentence is available in the metadata attribute of the graph:

>>> from penman import surface
>>> surface.alignments(g)
{('s', ':instance', 'sleep-01'): Alignment((3,)), ('c', ':instance', 'cat'): Alignment((2,))}
>>> g.metadata['snt']
'The cat slept .'

I don't (yet?) have a function to get the tokens automatically, but you can use the API to do it manually:

>>> tokens = g.metadata['snt'].split()
>>> alignments = surface.alignments(g)
>>> for triple in g.instances():
...   if triple in alignments:
...     indices = alignments[triple].indices
...   else:
...     indices = []
...   print(triple.source, '--', [tokens[i-1] for i in indices])
... 
s -- ['slept']
c -- ['cat']

Some notes:

JosephGatto commented 3 years ago

Wow, thank you for the amazing response this is exactly what I needed. Appreciate your time!!

goodmami commented 3 years ago

Glad it helped!

(Ideally this kind of information would make it into the documentation, but for now these issue comments will have to do.)