mojaie / MolecularGraph.jl

Graph-based molecule modeling toolkit for cheminformatics
MIT License
189 stars 27 forks source link

Bug in SMARTS queries #76

Open eahenle opened 1 year ago

eahenle commented 1 year ago

Given an input SMILES string and SMARTS query from the MACCS fingerprinting scheme (we are trying to implement this fingerprinting for the package) we found the following issue:

using MolecularGraph
mol = smilestomol("CCOP(=S)(OCC)Oc1cc(C)nc(C(C)C)n1")
query = smartstomol("[#6]=[#6](~[!#6;!#1])~[!#6;!#1]")
hassubstructmatch(mol, query) # returns true, but should return false!

Looking at the substructure match in Pluto, we see this:

begin
    matched1 = Set(Iterators.flatten(keys(m) for m in substructmatches(mol, query)))
    subg1 = MolecularGraph.nodesubgraph(mol, matched1)
    svg1 = MolecularGraph.drawsvg(mol, 300, 300, highlight=subg1)
    HTML(svg1)
end

image

This shows, I think, two problems:

This is one example, but for this single structure, there are many MACCS keys that return false positive.

mojaie commented 1 year ago

Thank you for the catch. Maybe SMARTS query is still not compatible with some advanced queries. I'm working on SMARTS in dev branch. Later I will check the current state.

mojaie commented 1 year ago

MACCS fingerprinting scheme (we are trying to implement this fingerprinting for the package)

I'm very happy to hear that!

eahenle commented 1 year ago

Here is the complete list of MACCS rules that are returning false-positive for the molecule shown above.

Each rule is a tuple that gives the SMARTS query and the count of matches that must be exceeded to turn the bit "on".

Tuple{String, Int64}[
("[#6]=[#6](~[!#6;!#1])~[!#6;!#1]", 0),
("[!#6;!#1]~[CH2]~[!#6;!#1]", 0),
("[!#6;!#1;!H0]~*~[!#6;!#1;!H0]", 0),
("[!#1;!#6;!#7;!#8;!#9;!#14;!#15;!#16;!#17;!#35;!#53]", 0),
("[#6]=[#6]~[#7]", 0),
("[!#6;!#1;!H0]~*~*~*~[!#6;!#1;!H0]", 0),
("[!#6;!#1;!H0]~*~*~[!#6;!#1;!H0]", 0),
("[!#6;!#1;!H0]~[!#6;!#1;!H0]", 0),
("[!#6;!#1]~[!#6;!#1;!H0]", 0),
("[!#6;!#1]~[#7]~[!#6;!#1]", 0),
("[#6]=[#6](~*)~*", 0),
("[#6]=[#7]", 0),
("*~[CH2]~[!#6;!#1;!H0]", 0),
("[C;H2,H3][!#6;!#1][C;H2,H3]", 0),
("[\$([!#6;!#1;!H0]~*~*~[CH2]~*),\$([!#6;!#1;!H0;R]1@[R]@[R]@[CH2;R]1),\$([!#6;!#1;!H0]~[R]1@[R]@[CH2;R]1)]", 0),
("[\$([!#6;!#1;!H0]~*~*~*~[CH2]~*),\$([!#6;!#1;!H0;R]1@[R]@[R]@[R]@[CH2;R]1),\$([!#6;!#1;!H0]~[R]1@[R]@[R]@[CH2;R]1),\$([!#6;!#1;!H0]~*~[R]1@[R]@[CH2;R]1)]", 0),
("[!#6;!#1]~[CH3]", 0),
("[!#6;!#1]~[#7]", 0),
("[#6]=[#6]", 0),
("[!#6;!#1;!H0]~*~[CH2]~*", 0),
("[#7]=*", 0),
("[!#6;!#1;!H0]", 1),
("*1~*~*~*~*~*~1", 1),
("[#6]-[#7]", 0)
]
mojaie commented 1 year ago

@eahenle queries you listed seems to return false at the new version (I checked it with v0.14.2).