Open timholy opened 3 years ago
I'm not sure but ModularProduct
may be expensive...
https://github.com/mojaie/MolecularGraph.jl/blob/be43e651c841a920b865cf8aa3f870eb7a963cf1/src/graph/product.jl#L24-L30
It's not the modularproduct
call itself that's expensive in and of itself (though of course it creates the expensive thing), it's the maximalcliques
/maximalconncliques
that costs the time. You can do a "poor-person's profiling" just by running mcismol
and hitting Ctrl-C and see where it is in the code when it breaks. If you do this several times, you'll see it's always in the clique-analysis.
I'm sorry, I didn't get your point.
maximalcliques
and maximalconncliques
are not optimized yet. If there is known implementation of maximal clique enumeration (like NetworkX but I'm not sure Julialang also have such one), we can do benchmarking and optimization. Or using well-established external graph mining library is an option.
Uniform molecule like your carbon ring example can't benefit from descriptor matching functions that can significantly reduce the modular product size and the execution time of maximalcliques
.
Here is a minimum sample code that returns modular product graph for mcesmol
using MolecularGraph
using MolecularGraph.Graph
function modprod(G, H)
lg = linegraph(G)
lh = linegraph(H)
am = MolecularGraph.atommatch(G, H)
bm = MolecularGraph.bondmatch(G, H)
ematch = MolecularGraph.Graph.lgedgematcher(lg, lh, am)
nmatch = MolecularGraph.Graph.lgnodematcher(lg, lh, am, bm)
eflt = MolecularGraph.Graph.modprodedgefilter(lg, lh, ematch)
return modularproduct(lg, lh, nodematcher=nmatch, edgefilter=eflt)
end
# Tutorial example
ldopa = smilestomol("O=C(O)[C@@H](N)Cc1cc(O)c(O)cc1")
ac = smilestomol("C1=CC=C2C(=C1)C=C(C(=O)O2)N")
prod = modprod(ldopa, ac)
println(nodecount(prod))
println(edgecount(prod))
195
861
# Your example
mol2 = smilestomol("C1C2C(CCC1)CCC2")
mol3 = smilestomol("C1C3C(C2C(C1)CCCC2)CCC3")
prod = modprod(mol2, mol3)
println(nodecount(prod))
println(edgecount(prod))
150
5922
Descriptor matching functions MolecularGraph.atommatch
recognize difference in atom symbols and hybridization (important in bond angle and 3D structure) by default.
Dealing with uniform molecule is a difficult problem. You can use connected=true
and topological=true
to reduce the cost if it is acceptable in your application. Even wIth these tricks, fullerene (C60) calculation may never finishes.
At the version 0.14, MolecularGraph.maximal_cliques
is now implemented with Graphs.SimpleGraph
.
julia> mol2 = smilestomol("C1C2C(CCC1)CCC2");
julia> mol3 = smilestomol("C1C3C(C2C(C1)CCCC2)CCC3");
julia> @time Graphs.maximal_cliques(tensor_product(mol2.graph, mol3.graph));
0.000526 seconds (7.14 k allocations: 703.336 KiB)
julia> @time MolecularGraph.maximal_cliques(tensor_product(mol2.graph, mol3.graph));
0.000507 seconds (8.68 k allocations: 851.594 KiB)
julia> @time Graphs.maximal_cliques(smallgraph(:karate))
0.000110 seconds (1.31 k allocations: 131.438 KiB)
36-element Vector{Vector{Int64}}:
[5, 1, 7]
[5, 1, 11]
[12, 1]
[8, 4, 2, 3, 1]
[17, 6, 7]
[1, 32]
[1, 6, 7]
[1, 6, 11]
[1, 9, 3]
[1, 13, 4]
⋮
[34, 33, 21]
[34, 33, 9, 31]
[34, 33, 24, 30]
[34, 33, 23]
[34, 33, 19]
[34, 10]
[34, 27, 30]
[2, 31]
[26, 24]
julia> @time MolecularGraph.maximal_cliques(smallgraph(:karate))
0.000138 seconds (1.96 k allocations: 206.531 KiB)
36-element Vector{Vector{Int64}}:
[5, 1, 7]
[5, 1, 11]
[12, 1]
[8, 4, 2, 3, 1]
[17, 6, 7]
[1, 32]
[1, 6, 7]
[1, 6, 11]
[1, 9, 3]
[1, 13, 4]
⋮
[34, 33, 21]
[34, 33, 9, 31]
[34, 33, 24, 30]
[34, 33, 23]
[34, 33, 19]
[34, 10]
[34, 27, 30]
[2, 31]
[26, 24]
I have tried several times with some graphs and it seems that Graphs
version performs slightly better.
On the other hand, MolecularGraphs.maximal_cliques
is based on a lazy Iterator and has several features that are useful in practical molecular mining (e.g. abortion by time out or reaching target size) , so I will continue to use this implementation for now.
julia> @time disconnected_mcis(mol2, mol3)
0.387954 seconds (2.06 M allocations: 219.295 MiB, 12.14% gc time)
MCSResult{Int64}(Dict(5 => 5, 4 => 4, 6 => 6, 7 => 11, 2 => 2, 9 => 13, 8 => 12, 3 => 3, 1 => 1), :done)
The computation time for MCS may have slightly improved. This mainly depends on the computational performance of the modular product generation mentioned above, and still there may be a room for optimization.
I have some not-huge molecules (a little bigger than the ones in this demo), and
mcismol
/mcesmol
are taking more runtime than I care to allow to finish. Here's a demo that illustrates the problem:even after first warmup. Profiling shows that essentially all the runtime is in https://github.com/mojaie/MolecularGraph.jl/blob/be43e651c841a920b865cf8aa3f870eb7a963cf1/src/graph/clique.jl#L83-L111
In contrast, if I convert this to a LightGraphs
SimpleGraph
:then
I'm far from certain that these are doing the same thing, but if they're even close it's a pretty big difference.