Rework are_isomorphic method of Molecule class for polymers

pbuslaev commented 9 months ago

Is your feature request related to a problem? Please describe. are_isomorphic method of Molecule class is slow for large polymers. This can be a problem when importing Topology from openmm. To check this, I generated capped ALA peptides with one randomly positioned ASP with tleap. For example:

sequence 1: ACE ALA ASP ALA ALA ALA NME
sequence 2: ACE ALA ALA ASP ALA ALA NME

Then I loaded generated pdb files to Molecule objects and called

Molecule.are_isomorphic(x.to_networkx(), y.to_networkx(),  
    aromatic_matching = False,
    formal_charge_matching = False,
    bond_order_matching = False,
    atom_stereochemistry_matching = False,
    bond_stereochemistry_matching= False,
    strip_pyrimidal_n_atom_stereo = False,)

as in from_openmm method and checked how long the method will work. The results are as follows:

# amino acids	time (s)
5	0.03
9	0.21
13	1.52
17	11.3
21	79.9
25	552

For even larger protein system the method can take even longer, which can stop the users from using from_openmm method. And the complexity is worse than $O(n^2)$

Describe the solution you'd like I think that the solution could be in case of polymers to test if molecules are isomorhic on the residue basis. We can use perceive_resides to split polymer into residues and then generate and compare graphs for each residue. This should be faster, since amino acids are small and we will need to do at maximum $n$ comparisons, where $n$ is the length of protein. This will require passing Molecule object to are_isomorphic, but this should be a minor issue.

If we agree on the suggested solution, I can start working on it.

mattwthompson commented 9 months ago

Any idea what causes the poor scaling? i.e. graph conversion vs. the actual comparisons

Note the use of networkx is already known to be horrible for performance - see some related discussion #617 #1143 https://github.com/openforcefield/openff-toolkit/issues/353. There might be ways to speed it up while keeping the base structure the same. (For example, the notion of residues is weak in non-biological polymers, so for some use cases it'd be nice to not need to specify the subunits.)

pbuslaev commented 9 months ago

Graph conversion is fast. The problem is at GraphMatcher stage.

mattwthompson commented 9 months ago

Last time I came across a similar problem (years back, on a different project, but still dealing with a sort of graph-matching issue) we wanted to use graph-tool which was fast but hard to install - I think that's changed.

I worry that continuing to put performance shortcuts on top of a slow algorithm will only get us so far and leave us with something that's brittle (already we have a ton of shortcuts that make it hard to change). But refactoring it might also be more work.

pbuslaev commented 9 months ago

If I see it correctly, the solution that I propose won't change much for the code. Inside are_isomorphic, we can add something like this:

mol1.perceive_residues()
mol2.perceive_residues()
if len(mol1.residues) > 1 and len(mol2.residues) > 1:
    return Molecule._compare_polymers(mol1, mol2)
# in other case we follow the previous solution

_compare_polymers will simply iterate over residues comparing residue pairs, again based on networkx approach.

From the global perspective the only thing that will change in this case, is that unique_molecules are stored as Molecule objects and all calls of are_isomorphic should use Molecule objects.

mattwthompson commented 9 months ago

I don't want to get in the way of you adding functionality - if you can get a new use case from glacial to viable without breaking other behavior that's a pure win - I just wanted to provide some context for why it's slow. The vast majority of the Molecule API has never been used with molecules larger than about 100 atoms until this year, so things like GraphMatcher being unworkably slow for polymers is an expected growing pain, and the sort of thing that's tempting to refactor without a small molecule assumption.

j-wags commented 9 months ago

I'm quite open to ideas about speeding this up, and coarsening the whole graph into repeating units seems like a promising approach. My biggest concern here is the reliance on residues specifically, for a few reasons:

(not a big problem) The residue names aren't totally informative (by the CCD's rules, the residue name "ASP" would be assigned to both aspartic acid and aspartate, but those aren't isomorphic)
(a big problem) Like Matt said, dividing on residues is a bit arbitrary, and we can't assume that perceive_residues will always be suitable for the user's polymer. Unconditionally adding perceive_residues to every are_isomorphic call will make all non-protein comparisons slower.
(not a big problem) Residue ordering outside proteins is arbitrary. And even within proteins, there are cases where insertions or modifications can make it unclear what "order" the residues go in.

I agree with the idea of simply swapping in a faster library like graph-tool (especially now that it's deployable from conda-forge). And if we want to try optimizations on top of that, the two general routes I see would be:

Some sort of "coarsening" approach, like preprocessing the structure to find repeating units (a more general form of "residues" from @pbuslaev's original post), and then comparing the coarsened graph.
Some sort of "canonical ordering" approach, where we find some O(1) or O(N) algorithm that assigns node indices identically for isomorphic graphs. So we'd run that canonical-ordering-algorithm on both inputs to assign them new node indices, and then just walk through all the nodes of both in index order and make sure they're all identical (O(N) time)

pbuslaev commented 9 months ago

Actually, fast testing with Networkx, igraph, and graph_tools gave the following results:

k = 3
print(f"Test for a system with {p1[k].n_atoms} atoms and {p1[k].n_bonds}")
s = time.time()
# networkx
def node_match_func(x, y):
    # always match by atleast atomic number
    is_equal = x["atomic_number"] == y["atomic_number"]
    return is_equal
GM = GraphMatcher(p1[k].to_networkx(), p2[k].to_networkx(), node_match=node_match_func)
print(f"Graphs are isomorhpic: {GM.is_isomorphic()}")
print(f"Networkx took {time.time() - s}")
#networkx, no node_match function
s = time.time()
# networkx
GM = GraphMatcher(p1[k].to_networkx(), p2[k].to_networkx())
print(f"Graphs are isomorhpic: {GM.is_isomorphic()}")
print(f"Networkx without node_match function took {time.time() - s}")
# igraph bliss
s = time.time()
g = igraph.Graph(len(list(p1[3].to_networkx().nodes)), [list(x) for x in list(p1[3].to_networkx().edges)])
h = igraph.Graph(len(list(p2[3].to_networkx().nodes)), [list(x) for x in list(p2[3].to_networkx().edges)])
print(f"Graphs are isomorhpic: {g.isomorphic(h)}")
print(f"Igraph took {time.time() - s}")
# igraph vf2
s = time.time()
g = igraph.Graph(len(list(p1[3].to_networkx().nodes)), [list(x) for x in list(p1[3].to_networkx().edges)])
h = igraph.Graph(len(list(p2[3].to_networkx().nodes)), [list(x) for x in list(p2[3].to_networkx().edges)])
print(f"Graphs are isomorhpic: {g.isomorphic_vf2(h)}")
print(f"Igraph took {time.time() - s}")
# graph_tool
s = time.time()
g = Graph([list(x) for x in list(p1[k].to_networkx().edges)])
h = Graph([list(x) for x in list(p2[k].to_networkx().edges)])
print(f"Graphs are isomorhpic: {isomorphism(g, h)}")
print(f"Graph_tool took {time.time() - s}")

Test for a system with 184 atoms and 183
Graphs are isomorhpic: False
Networkx took 11.790411710739136
Graphs are isomorhpic: False
Igraph took 0.005591154098510742
Graphs are isomorhpic: False
Igraph took 95.27721905708313
Graphs are isomorhpic: False
Graph_tool took 104.98597836494446

So the fastest method as for now is igraph.isomorphic, which is using bliss algorithm. The problem is, if I understand correctly, that we can not pass any vertex attributes to this method. We can pass attributes to igraph.isomorhic_vf2. Networkx also uses vf2 algorithm. And for smaller systems without passing node_match function to Networkx, igraph was several fold faster

Test for a system with 64 atoms and 63
Graphs are isomorhpic: False
Networkx without node_match function took 0.3168954849243164
Graphs are isomorhpic: False
Igraph took 0.0030617713928222656
Test for a system with 104 atoms and 103
Graphs are isomorhpic: False
Networkx without node_match function took 15.669962644577026
Graphs are isomorhpic: False
Igraph took 0.05946803092956543

I tried passing attributes to igraph and graph_tool. With igraph I did not get any performance gain, while graph_tools worked faster than without property passing, but still slower than networkx:

#graph tool with vertex map
print(f"Test for a system with {p1[k].n_atoms} atoms and {p1[k].n_bonds}")
s = time.time()
g = Graph([list(x) for x in list(p1[k].to_networkx().edges)])
prop1 = g.new_vertex_property("int")
prop1.a = np.array([p1[k].to_networkx().nodes[x]["atomic_number"] for x in range(len(p1[k].to_networkx().nodes))])
h = Graph([list(x) for x in list(p2[k].to_networkx().edges)])
prop2 = h.new_vertex_property("int")
prop2.a = np.array([p2[k].to_networkx().nodes[x]["atomic_number"] for x in range(len(p2[k].to_networkx().nodes))])
print(f"Graphs are isomorhpic: {isomorphism(g, h, prop1, prop2)}")
print(f"Graph_tool with vertex invariants took {time.time() - s}")
#igraph with node matching
print(f"Test for a system with {p1[3].n_atoms} atoms and {p1[3].n_bonds}")
s = time.time()
g = igraph.Graph(len(list(p1[3].to_networkx().nodes)), [list(x) for x in list(p1[3].to_networkx().edges)])
g["a"] = [p1[3].to_networkx().nodes[x]["atomic_number"] for x in range(len(p1[3].to_networkx().nodes))]
h = igraph.Graph(len(list(p2[3].to_networkx().nodes)), [list(x) for x in list(p2[3].to_networkx().edges)])
h["a"] = [p2[3].to_networkx().nodes[x]["atomic_number"] for x in range(len(p2[3].to_networkx().nodes))]
def node_match_func(g1, g2, i1, i2):
    # always match by atleast atomic number
    is_equal = g1["a"][i1] == g2["a"][i1]
    return is_equal
print(f"Graphs are isomorhpic: {g.isomorphic_vf2(h, node_compat_fn=node_match_func)}")
print(f"Igraph with attributes took {time.time() - s}")

Test for a system with 184 atoms and 183
Graphs are isomorhpic: False
Graph_tool with vertex invariants took 27.565279006958008
Test for a system with 184 atoms and 183
Graphs are isomorhpic: False
Igraph with attributestook 88.5980315208435

So my current opinion, that using other libraries is a proper solution.

(a big problem) Like Matt said, dividing on residues is a bit arbitrary, and we can't assume that perceive_residues will always be suitable for the user's polymer. Unconditionally adding perceive_residues to every are_isomorphic call will make all non-protein comparisons slower. Some sort of "coarsening" approach, like preprocessing the structure to find repeating units (a more general form of "residues" from @pbuslaev's original post), and then comparing the coarsened graph.

I think that splitting into residues can be done, when molecule is created (at least residues are already there if molecule is created with Molecule.from_polymer_pdb and some other methods). Then splitting into residues with perceive_residues is not needed. And all we need to do, is to check if residues are present for molecule, and if yes, run are_isomorphic on the residue basis.

jchodera commented 9 months ago

If the isomorphism check for residue-coarsened molecules can be made to work correctly (that is, it is always correct and never an approximation if the template matching succeeded in matching the entire molecule to some residue templates), this sounds like a great approach. If we provide templates for the majority of the common use cases, we speed up nearly all the common use cases without losing correctness. We could still fall back to slow graph marching when there are things that don't march templates.

pbuslaev commented 9 months ago

A slightly different approach can be to

Match residues first (do graph matching on the residue basis). This should be fast
If residues can be mapped, we run additional graph matching for atoms of the residues

openforcefield / openff-toolkit

Rework are_isomorphic method of Molecule class for polymers #1734