mojaie / MolecularGraph.jl

Graph-based molecule modeling toolkit for cheminformatics
MIT License
192 stars 28 forks source link

Performance of common substructure/maximalcliques #50

Open timholy opened 3 years ago

timholy commented 3 years ago

I have some not-huge molecules (a little bigger than the ones in this demo), and mcismol/mcesmol are taking more runtime than I care to allow to finish. Here's a demo that illustrates the problem:

julia> mol1 = smilestomol("C1CCCC1");   # hexane

julia> mol2 = smilestomol("C1C2C(CCC1)CCC2");   # hexahydroindan, i.e., https://pubchem.ncbi.nlm.nih.gov/compound/10325

julia> mol3 = smilestomol("C1C3C(C2C(C1)CCCC2)CCC3"); # https://pubchem.ncbi.nlm.nih.gov/compound/12729210

julia> @time mcismol(mol2, mol3)
  0.604197 seconds (2.20 M allocations: 274.098 MiB, 16.81% gc time)
(Dict(5 => 5, 4 => 4, 6 => 6, 7 => 11, 2 => 2, 9 => 13, 8 => 12, 3 => 3, 1 => 1), :done)

even after first warmup. Profiling shows that essentially all the runtime is in https://github.com/mojaie/MolecularGraph.jl/blob/be43e651c841a920b865cf8aa3f870eb7a963cf1/src/graph/clique.jl#L83-L111

In contrast, if I convert this to a LightGraphs SimpleGraph:

function simplegraph(mol::UndirectedGraph)
    g = LightGraphs.SimpleGraph(nodecount(mol))
    for e in edgesiter(mol)
        LightGraphs.add_edge!(g, e...)
    end
    return g
end

then

julia> g1, g2, g3 = simplegraph(mol1), simplegraph(mol2), simplegraph(mol3);

julia> @time maximal_cliques(tensor_product(g2, g3));
  0.000712 seconds (7.20 k allocations: 838.570 KiB)

I'm far from certain that these are doing the same thing, but if they're even close it's a pretty big difference.

mojaie commented 3 years ago

I'm not sure but ModularProduct may be expensive... https://github.com/mojaie/MolecularGraph.jl/blob/be43e651c841a920b865cf8aa3f870eb7a963cf1/src/graph/product.jl#L24-L30

timholy commented 3 years ago

It's not the modularproduct call itself that's expensive in and of itself (though of course it creates the expensive thing), it's the maximalcliques/maximalconncliques that costs the time. You can do a "poor-person's profiling" just by running mcismol and hitting Ctrl-C and see where it is in the code when it breaks. If you do this several times, you'll see it's always in the clique-analysis.

mojaie commented 3 years ago

I'm sorry, I didn't get your point.

  1. maximalcliques and maximalconncliques are not optimized yet. If there is known implementation of maximal clique enumeration (like NetworkX but I'm not sure Julialang also have such one), we can do benchmarking and optimization. Or using well-established external graph mining library is an option.

  2. Uniform molecule like your carbon ring example can't benefit from descriptor matching functions that can significantly reduce the modular product size and the execution time of maximalcliques.

Here is a minimum sample code that returns modular product graph for mcesmol

using MolecularGraph
using MolecularGraph.Graph

function modprod(G, H)
    lg = linegraph(G)
    lh = linegraph(H)
    am = MolecularGraph.atommatch(G, H)
    bm = MolecularGraph.bondmatch(G, H)
    ematch = MolecularGraph.Graph.lgedgematcher(lg, lh, am)
    nmatch = MolecularGraph.Graph.lgnodematcher(lg, lh, am, bm)
    eflt = MolecularGraph.Graph.modprodedgefilter(lg, lh, ematch)
    return modularproduct(lg, lh, nodematcher=nmatch, edgefilter=eflt)
end
# Tutorial example
ldopa = smilestomol("O=C(O)[C@@H](N)Cc1cc(O)c(O)cc1")
ac = smilestomol("C1=CC=C2C(=C1)C=C(C(=O)O2)N")
prod = modprod(ldopa, ac)
println(nodecount(prod))
println(edgecount(prod))
195
861
# Your example
mol2 = smilestomol("C1C2C(CCC1)CCC2")
mol3 = smilestomol("C1C3C(C2C(C1)CCCC2)CCC3")
prod = modprod(mol2, mol3)
println(nodecount(prod))
println(edgecount(prod))
150
5922

Descriptor matching functions MolecularGraph.atommatch recognize difference in atom symbols and hybridization (important in bond angle and 3D structure) by default.

Dealing with uniform molecule is a difficult problem. You can use connected=true and topological=true to reduce the cost if it is acceptable in your application. Even wIth these tricks, fullerene (C60) calculation may never finishes.

mojaie commented 1 year ago

At the version 0.14, MolecularGraph.maximal_cliques is now implemented with Graphs.SimpleGraph.

julia> mol2 = smilestomol("C1C2C(CCC1)CCC2");
julia> mol3 = smilestomol("C1C3C(C2C(C1)CCCC2)CCC3");

julia> @time Graphs.maximal_cliques(tensor_product(mol2.graph, mol3.graph));
  0.000526 seconds (7.14 k allocations: 703.336 KiB)
julia> @time MolecularGraph.maximal_cliques(tensor_product(mol2.graph, mol3.graph));
  0.000507 seconds (8.68 k allocations: 851.594 KiB)
julia> @time Graphs.maximal_cliques(smallgraph(:karate))
  0.000110 seconds (1.31 k allocations: 131.438 KiB)
36-element Vector{Vector{Int64}}:
 [5, 1, 7]
 [5, 1, 11]
 [12, 1]
 [8, 4, 2, 3, 1]
 [17, 6, 7]
 [1, 32]
 [1, 6, 7]
 [1, 6, 11]
 [1, 9, 3]
 [1, 13, 4]
 ⋮
 [34, 33, 21]
 [34, 33, 9, 31]
 [34, 33, 24, 30]
 [34, 33, 23]
 [34, 33, 19]
 [34, 10]
 [34, 27, 30]
 [2, 31]
 [26, 24]
julia> @time MolecularGraph.maximal_cliques(smallgraph(:karate))
  0.000138 seconds (1.96 k allocations: 206.531 KiB)
36-element Vector{Vector{Int64}}:
 [5, 1, 7]
 [5, 1, 11]
 [12, 1]
 [8, 4, 2, 3, 1]
 [17, 6, 7]
 [1, 32]
 [1, 6, 7]
 [1, 6, 11]
 [1, 9, 3]
 [1, 13, 4]
 ⋮
 [34, 33, 21]
 [34, 33, 9, 31]
 [34, 33, 24, 30]
 [34, 33, 23]
 [34, 33, 19]
 [34, 10]
 [34, 27, 30]
 [2, 31]
 [26, 24]

I have tried several times with some graphs and it seems that Graphs version performs slightly better. On the other hand, MolecularGraphs.maximal_cliques is based on a lazy Iterator and has several features that are useful in practical molecular mining (e.g. abortion by time out or reaching target size) , so I will continue to use this implementation for now.

julia> @time disconnected_mcis(mol2, mol3)
  0.387954 seconds (2.06 M allocations: 219.295 MiB, 12.14% gc time)
MCSResult{Int64}(Dict(5 => 5, 4 => 4, 6 => 6, 7 => 11, 2 => 2, 9 => 13, 8 => 12, 3 => 3, 1 => 1), :done)

The computation time for MCS may have slightly improved. This mainly depends on the computational performance of the modular product generation mentioned above, and still there may be a room for optimization.