mojaie / MolecularGraph.jl

Graph-based molecule modeling toolkit for cheminformatics
MIT License
197 stars 29 forks source link

Monomorphism mapping algorithm and its performance #48

Closed mojaie closed 3 years ago

mojaie commented 3 years ago

I was working on #33 in refactorsubstruct branch. It is almost tested and ready to merge.

Monomorphism mapping based on VF2 algorithm is newly implemented as a substructure match function. This is equivalent to typical substructure search, and returns node mapping.

julia> mol = smilestomol("C1C2=CC=CC3=C2C(=CC=C3)C1=O");
julia> query = smartstomol("[#6]~1~[#6]~[#6]~[#6]~2~[#6](~[#6]~1)~[#6]~[#6]~[#6]~2~[#8]");
julia> matches = structmatches(mol, query, :substruct);
julia> iterate(matches)[1]
Dict{Int64,Int64} with 10 entries:
  7  => 5
  9  => 3
  12 => 9
  10 => 2
  13 => 10
  2  => 7
  11 => 1
  8  => 4
  6  => 6
  1  => 8

This may be useful in the case like #45. It is different from edge-induced mapping in the case of isolated node queries (edge-induced mapping does not care single atom so the query like [#16;X2;!R] never matches).

Actually there are two monomorphism mapping methods in this library.

  1. Monomorphism mapping with VF2 algorithm
  2. Edge-induced subgraph isomorphism mapping -> maximum cardinality matching of isolated nodes -> emaptonmap(#45)

I'm not sure about

VF2 monomorphism is known to be inappropriate in some cases https://github.com/rdkit/rdkit/issues/880 Edge induced approach could be a more stable alternative.

Anyway, I will try benchmarking.

mojaie commented 3 years ago

Here is a simple benchmark script.

using DotEnv
DotEnv.config()

using Pkg
Pkg.add(PackageSpec(name="MolecularGraph", rev="master"))

using MolecularGraph
using MolecularGraph.Graph
using BenchmarkTools
using DataFrames
using CSV

srcpath = "DrugBank 5.1.8/open structures.sdf"
fpath = joinpath(ENV["CHEMICAL_LIBRARY_PATH"], srcpath)

mols = collect(Iterators.take(sdfilereader(fpath), 1000))
for mol in mols
    precalculate!(mol)
end

queries = [
    "C=1C=CC=CC1" => smilestomol("C=1C=CC=CC1"),
    "taxol" => smilestomol("CC1=C2[C@@]([C@]([C@H]([C@@H]3[C@]4([C@H](OC4)C[C@@H]([C@]3(C(=O)[C@@H]2OC(=O)C)C)O)OC(=O)C)OC(=O)c5ccccc5)(C[C@@H]1OC(=O)[C@H](O)[C@@H](NC(=O)c6ccccc6)c7ccccc7)O)(C)C"),
    "[#6][OD2][#6]" => smartstomol("[#6][OD2][#6]"),
    "O.O.OS(O)(=O)=O" => smartstomol("O.O.OS(O)(=O)=O")
]

data = DataFrame(query=[])
for (name, query) in queries
    suite = BenchmarkGroup()
    suite["exact"] = @benchmarkable [!isempty(structmatches(m, $query, :exact)) for m in $mols]
    suite["substruct"] = @benchmarkable [!isempty(structmatches(m, $query, :substruct)) for m in $mols]
    suite["nodeinduced"] = @benchmarkable [!isempty(structmatches(m, $query, :nodeinduced)) for m in $mols]
    suite["edgeinduced"] = @benchmarkable [!isempty(structmatches(m, $query, :edgeinduced)) for m in $mols]
    tune!(suite)
    res = run(suite, verbose = true, seconds = 1)
    row = time(median(res)).data
    row["query"] = name
    push!(data, row, cols=:union)
end

display(data)
outpath = joinpath(ENV["OUTPUT_DIR"], "bstructsearch_master.csv")
CSV.write(outpath, data)

Version 0.8.0

Note that substruct is based on edge-induced subgraph isomorphism

 Row │ query            exact     substruct 
     │ Any              Float64   Float64   
─────┼──────────────────────────────────────
   1 │ C=1C=CC=CC1       96702.0  1.40413e9
   2 │ taxol             66235.0  2.48469e7
   3 │ [#6][OD2][#6]    270486.0  4.22994e8
   4 │ O.O.OS(O)(=O)=O   62971.0  3.65845e8

Head d551ee5

Substructure search methods were redesigned. Monomorphism-based substructure search is a little bit faster than the above, but a disconnected query took longer.

 Row │ query            exact          substruct   nodeinduced  edgeinduced 
     │ Any              Float64        Float64     Float64      Float64     
─────┼──────────────────────────────────────────────────────────────────────
   1 │ C=1C=CC=CC1      67730.0        1.33409e8    2.18154e8     2.64464e8
   2 │ taxol            33996.0        9.56841e6    3.23519e7     2.31144e7
   3 │ [#6][OD2][#6]        2.34968e5  2.88606e8    4.12895e8     3.36828e8
   4 │ O.O.OS(O)(=O)=O  33983.0        8.75208e10   3.51557e11    3.17156e8

Improved pivot for VF2

https://github.com/mojaie/MolecularGraph.jl/blob/dbd2c78b5aba762fc7fcfbcc330f6cb75ec7f08c/src/graph/isomorphism/vf2.jl#L89-L91

This improved the performance of some kind of disconnected query.

 Row │ query            exact     substruct  nodeinduced  edgeinduced 
     │ Any              Float64   Float64    Float64      Float64     
─────┼────────────────────────────────────────────────────────────────
   1 │ C=1C=CC=CC1       69439.0  1.39935e8    2.26502e8    2.73493e8
   2 │ taxol             34375.0  4.23358e7    2.29409e7    2.26587e7
   3 │ [#6][OD2][#6]    230717.0  2.19852e8    2.79611e8    3.47654e8
   4 │ O.O.OS(O)(=O)=O   34045.0  2.0455e8     2.77854e8    3.30751e8