Closed mojaie closed 3 years ago
Here is a simple benchmark script.
using DotEnv
DotEnv.config()
using Pkg
Pkg.add(PackageSpec(name="MolecularGraph", rev="master"))
using MolecularGraph
using MolecularGraph.Graph
using BenchmarkTools
using DataFrames
using CSV
srcpath = "DrugBank 5.1.8/open structures.sdf"
fpath = joinpath(ENV["CHEMICAL_LIBRARY_PATH"], srcpath)
mols = collect(Iterators.take(sdfilereader(fpath), 1000))
for mol in mols
precalculate!(mol)
end
queries = [
"C=1C=CC=CC1" => smilestomol("C=1C=CC=CC1"),
"taxol" => smilestomol("CC1=C2[C@@]([C@]([C@H]([C@@H]3[C@]4([C@H](OC4)C[C@@H]([C@]3(C(=O)[C@@H]2OC(=O)C)C)O)OC(=O)C)OC(=O)c5ccccc5)(C[C@@H]1OC(=O)[C@H](O)[C@@H](NC(=O)c6ccccc6)c7ccccc7)O)(C)C"),
"[#6][OD2][#6]" => smartstomol("[#6][OD2][#6]"),
"O.O.OS(O)(=O)=O" => smartstomol("O.O.OS(O)(=O)=O")
]
data = DataFrame(query=[])
for (name, query) in queries
suite = BenchmarkGroup()
suite["exact"] = @benchmarkable [!isempty(structmatches(m, $query, :exact)) for m in $mols]
suite["substruct"] = @benchmarkable [!isempty(structmatches(m, $query, :substruct)) for m in $mols]
suite["nodeinduced"] = @benchmarkable [!isempty(structmatches(m, $query, :nodeinduced)) for m in $mols]
suite["edgeinduced"] = @benchmarkable [!isempty(structmatches(m, $query, :edgeinduced)) for m in $mols]
tune!(suite)
res = run(suite, verbose = true, seconds = 1)
row = time(median(res)).data
row["query"] = name
push!(data, row, cols=:union)
end
display(data)
outpath = joinpath(ENV["OUTPUT_DIR"], "bstructsearch_master.csv")
CSV.write(outpath, data)
Note that substruct
is based on edge-induced subgraph isomorphism
Row │ query exact substruct
│ Any Float64 Float64
─────┼──────────────────────────────────────
1 │ C=1C=CC=CC1 96702.0 1.40413e9
2 │ taxol 66235.0 2.48469e7
3 │ [#6][OD2][#6] 270486.0 4.22994e8
4 │ O.O.OS(O)(=O)=O 62971.0 3.65845e8
Substructure search methods were redesigned. Monomorphism-based substructure search is a little bit faster than the above, but a disconnected query took longer.
Row │ query exact substruct nodeinduced edgeinduced
│ Any Float64 Float64 Float64 Float64
─────┼──────────────────────────────────────────────────────────────────────
1 │ C=1C=CC=CC1 67730.0 1.33409e8 2.18154e8 2.64464e8
2 │ taxol 33996.0 9.56841e6 3.23519e7 2.31144e7
3 │ [#6][OD2][#6] 2.34968e5 2.88606e8 4.12895e8 3.36828e8
4 │ O.O.OS(O)(=O)=O 33983.0 8.75208e10 3.51557e11 3.17156e8
This improved the performance of some kind of disconnected query.
Row │ query exact substruct nodeinduced edgeinduced
│ Any Float64 Float64 Float64 Float64
─────┼────────────────────────────────────────────────────────────────
1 │ C=1C=CC=CC1 69439.0 1.39935e8 2.26502e8 2.73493e8
2 │ taxol 34375.0 4.23358e7 2.29409e7 2.26587e7
3 │ [#6][OD2][#6] 230717.0 2.19852e8 2.79611e8 3.47654e8
4 │ O.O.OS(O)(=O)=O 34045.0 2.0455e8 2.77854e8 3.30751e8
I was working on #33 in
refactorsubstruct
branch. It is almost tested and ready to merge.Monomorphism mapping based on VF2 algorithm is newly implemented as a substructure match function. This is equivalent to typical substructure search, and returns node mapping.
This may be useful in the case like #45. It is different from edge-induced mapping in the case of isolated node queries (edge-induced mapping does not care single atom so the query like
[#16;X2;!R]
never matches).Actually there are two monomorphism mapping methods in this library.
emaptonmap
(#45)I'm not sure about
VF2 monomorphism is known to be inappropriate in some cases https://github.com/rdkit/rdkit/issues/880 Edge induced approach could be a more stable alternative.
Anyway, I will try benchmarking.