marrink-lab / vermouth-martinize

Describe and apply transformation on molecular structures and topologies
Apache License 2.0
86 stars 38 forks source link

slow subgraph #446

Open fgrunewald opened 2 years ago

fgrunewald commented 2 years ago

it appears nx.subgraph is rather slow, which directly affects a bunch of processes in martinize2. Mostly it is used when making the residue graph. Depending on the complexity of the residues a dumb implementation like shown below is about 5-10 times faster I assume because it scales much better than whatever nx does.

Why do you then not simply implement that you may ask: Something something ISMAGS now dies.

def _custom_subgraph(graph, nodes):
    """
    Make a custom subgraph copy of nodes from graph.
    """
    subgraph = nx.Graph()
    for node in nodes:
        subgraph.add_node(node, **graph.nodes[node])
        for neighbor in graph.edges(node):
            if neighbor in nodes:
                subgraph.add_edge(node, neighbor)
    return subgraph
pckroon commented 1 year ago

This simple implementation doesn't work, because add_edge adds the node neighbour without attributes. My guess is that this is faster since you don't iterate over all the edges in graph, but only over edges connected to nodes. You could try removing the subgraph.add_node, and add another loop over nodes at the end to set all the node attributes.

fgrunewald commented 1 year ago

@pckroon not sure I understand. If node becomes neighbour doesn't this subgraph.add_node(node, **graph.nodes[node]) add the attributes needed? For polyply it seems to work and most martinis processors as well except the ISMAGS

pckroon commented 1 year ago

I'm not completely sure that will do what you want. Would require some experimentation. Oh, and you lose all edge attributes, that could also be it.

fgrunewald commented 1 year ago

If I find the time I'll play around with it even though I still don't understand why the molecule subgraph is so slow. A regular networks graph even with a ton of attributes is much faster than the equivalent molecule/block. Any idea if we killed some performance enhancing functions?

pckroon commented 1 year ago

A regular networks graph even with a ton of attributes is much faster

For the same number of nodes/edges?

Molecule.subgraph also deals with interactions, which probably also slows things down. I think the main reason is that the default nx subgraph actually creates a view with shared node/edge attributes, while Molecule.subgraph creates an actual Graph with its own, copied, node attributes. See https://github.com/networkx/networkx/blob/main/networkx/classes/graph.py#L1694

fgrunewald commented 1 year ago

@pckroon I think you're right here. Our subgraph copies interactions by looping over all of them plus it does an actual copy of the nodes whereas the nx function returns a view. Now the question is do we need this thorough subgraph really? I will investigate a little bit by stripping out some parts and see where it breaks. But my feeling is that interaction copying could be optional with a keyword arg and/or becomes more efficient with the hashable interactions data structure.

pckroon commented 1 year ago

Now the question is do we need this thorough subgraph really?

Good question. My guess is we did it this way so we wouldn't have to hunt it down as an obscure bug sometime in the future. My concern with adding a keyword for the interactions is that Molecule.subgraph will have a different signature than Graph.subgraph. I'll probably survive though ;) Note that we generally call graph.subgraph (or molecule, whatever), but the residue graph function calls nx.subgraph https://github.com/marrink-lab/vermouth-martinize/blob/master/vermouth/graph_utils.py#L208, which actually calls graph.subgraph (https://github.com/networkx/networkx/blob/bc29f29698b4683286954dddaea64cd108d60f93/networkx/classes/function.py#L347). So potato, potato.