goodmami / penman

PENMAN notation (e.g. AMR) in Python
https://penman.readthedocs.io/
MIT License
139 stars 27 forks source link

Is there a way to restore disconnteced subgraphs? #140

Closed jkang1640 closed 5 months ago

jkang1640 commented 5 months ago

Before diving into my question, I'd like to express my sincere appreciation to the contributors of this project.

Here's my question : Is there an existing method available to extract a set of subgraphs when dealing with a disconnected graph?

import penman 
triples = [('m', ':instance', 'multi-sentence'), ('f', ':instance', 'fleet'), ('s', ':instance', 'struggle-01'), ('b', ':instance', 'boat'), ('f2', ':instance', 'fish-01'), ('f3', ':instance', 'fantôm'), ('l', ':instance', 'little'), ('c', ':instance', 'country'), ('n', ':instance', 'name'), ('e', ':instance', 'evil'), ('d', ':instance', 'dwell-01'), ('a', ':instance', 'and'), ('u', ':instance', 'unrest'), ('a2', ':instance', 'agitate-01'), ('s2', ':instance', 'show-01'), ('t', ':instance', 'they'), ('a3', ':instance', 'appear-02'), ('w', ':instance', 'wolf'), ('a4', ':instance', 'again'), ('a5', ':instance', 'and'), ('h', ':instance', 'heart'), ('p', ':instance', 'person'), ('r', ':instance', 'rob-01'), ('f4', ':instance', 'form'), ('a6', ':instance', 'arrogant'), ('m', ':snt1', 'f'), ('s', ':ARG0', 'f'), ('s', ':ARG1', 'b'), ('f2', ':instrument', 'b'), ('m', ':snt2', 'f3'), ('f3', ':mod', 'l'), ('f3', ':mod', 'c')]
g = penman.graph.Graph(triples) 
penman.encode(g) 

It will create graph successfully (#line 3) but then when I try to encode it (#line 4), it will throw an 'Disconnected graph'. Is there any way to extract a list of subgraphs, which are disconnected from each other?

The objective is to include the subgraphs for evaluation instead of discarding the whole graphs when they're disconnected.

Before embarking on coding this myself, I'm curious if such functionality already exists within the project.

Thank you very much.

flipz357 commented 5 months ago

@jkang1640

The objective is to include the subgraphs for evaluation

Correct me if I understand your problem wrongly, but evaluation does not require a connected graph.

goodmami commented 5 months ago

@jkang1640 Thanks for your question and I'm glad to hear you're finding the library useful!

It will create graph successfully (#line 3) but then when I try to encode it (#line 4), it will throw an 'Disconnected graph'.

I know this is not your question, but to be clear: There is no way in the current PENMAN notation (see the docs on Allowed Graphs) to encode a disconnected graph, so the library throws an error when trying to encode it, but there is no problem with representing a disconnected graph in the Graph data structure, so you don't get an error sooner.

Is there any way to extract a list of subgraphs, which are disconnected from each other?

Certainly, but not directly with the Penman library. Penman is really only concerned with decoding strings in PENMAN notation to graphs (through the intermediate tree form) and encoding graphs to strings. As such, it only implements a small number of graph operations to support the encoding/decoding as well as some kinds of model-specific (e.g., AMR) inspections and transformations. Finding the connected components of the graph would be better done with a dedicated graphing library function like NetworkX.connected_components. For example:

>>> import networkx as nx
>>> import penman
>>> triples = [('m', ':instance', 'multi-sentence'), ('f', ':instance', 'fleet'), ('s', ':instance', 'struggle-01'), ('b', ':instance', 'boat'), ('f2', ':instance', 'fish-01'), ('f3', ':instance', 'fantôm'), ('l', ':instance', 'little'), ('c', ':instance', 'country'), ('n', ':instance', 'name'), ('e', ':instance', 'evil'), ('d', ':instance', 'dwell-01'), ('a', ':instance', 'and'), ('u', ':instance', 'unrest'), ('a2', ':instance', 'agitate-01'), ('s2', ':instance', 'show-01'), ('t', ':instance', 'they'), ('a3', ':instance', 'appear-02'), ('w', ':instance', 'wolf'), ('a4', ':instance', 'again'), ('a5', ':instance', 'and'), ('h', ':instance', 'heart'), ('p', ':instance', 'person'), ('r', ':instance', 'rob-01'), ('f4', ':instance', 'form'), ('a6', ':instance', 'arrogant'), ('m', ':snt1', 'f'), ('s', ':ARG0', 'f'), ('s', ':ARG1', 'b'), ('f2', ':instrument', 'b'), ('m', ':snt2', 'f3'), ('f3', ':mod', 'l'), ('f3', ':mod', 'c')]
>>> g = penman.Graph(triples)
>>> # now populate a networkx graph for inspection
>>> g2 = nx.Graph()
>>> g2.add_nodes_from(g.variables())
>>> g2.add_edges_from((src, tgt) for src, _, tgt in g.edges())
>>> for cc in nx.connected_components(g2):
...     print(cc)
... 
{'s', 'f2', 'f', 'f3', 'm', 'c', 'l', 'b'}
{'u'}
{'a6'}
{'s2'}
{'e'}
{'n'}
{'p'}
{'a5'}
{'d'}
{'a'}
{'r'}
{'a4'}
{'a2'}
{'f4'}
{'a3'}
{'t'}
{'w'}
{'h'}

NetworkX's connected components operation essentially does a breadth-first search on the graph. Penman does in fact do a depth-first search in penman.model.Model.errors(), which can tell you which nodes are unreachable (i.e., disconnected) from the top node, but it won't tell you what the connected components are. In the above example it probably wouldn't matter since each non-primary component is only a single node.

The objective is to include the subgraphs for evaluation instead of discarding the whole graphs when they're disconnected.

@flipz357 is correct (depending on your evaluation methodology, I suppose) that evaluation doesn't strictly require connectivity as it generally operates on triples, but I suspect you are trying to encode to a PENMAN string so you can load it into some external evaluator, like Smatch?

jkang1640 commented 5 months ago

Thank you very much for your quick answers @flipz357 @goodmami

suspect you are trying to encode to a PENMAN string so you can load it into some external evaluator, like Smatch?

Yes, exactly! I will look more into NetworkX as your suggestions, @goodmami. Thank you very much for the code snippet. It gives me a good starting point :)

flipz357 commented 5 months ago

@jkang1640

suspect you are trying to encode to a PENMAN string so you can load it into some external evaluator, like Smatch?

Yes exactly!

Note that this is not necessary. Both Smatch and Smatch++ can be used from within python.

In my view it is very questionworthy why a disconnected graph should be possibly greatly penalized, while other (I suppose) AMR violations (e.g., an edge that doesn't exist in AMR) should be allowed.

After all, a disconnected AMR can contribute the same value to downstream application than some other "ill-formatted" AMR.

If you have disconnected graph and want to serialize, just use a tsv format.

jkang1640 commented 5 months ago

Thank you for sharing your insightful thoughts on AMR evaluation.

I was not clear in my first posting but once I have the list of subgraphs, I wanted to choose one subgraph that spans the most so that I can still evaluate that part of the graph. My current system just discards the entire graph when encoding fails and gives 0 smatch score even though a subgraph is almost complete except for small disconnected fragments.

I think I can start somewhere now with your help. Thank you!