rdkit / rdkit

The official sources for the RDKit library
BSD 3-Clause "New" or "Revised" License
2.69k stars 881 forks source link

Substructure matching with general groups #943

Closed kovasap closed 8 years ago

kovasap commented 8 years ago

Is it possible for rdkit to do substructure matching with * atoms (general R groups)? I'm trying to see if two molecules are effectively identical (one molecule's * atoms match the other's fully specified groups) through substructure matching. I have been doing this by getting each molecule's substructure matches with the other and then comparing the length of the longest match (the number of atoms that matched) with the number of atoms in the molecule. If the lengths are the same, I would say the molecules are identical. Currently this isn't working because of the way substructure matching handles * atoms. Is there a better way to do this?

Here are two smiles strings that my method should say are identical:

[H:12][O:4][C:2]([H:10])([C:1]([H:9])([*:6])[*:8])[C:3]([H:11])([*:5])[*:7]
[H:12][O:4][C:2]([H:10])([C:1]([H:9])([H:6])[H:8])[C:3]([H:11])([C:5]([H:15])([H:16])([H:17]))[C:7]([H:12])([H:13])[*:14]

If there's a better place to post questions like this please let me know!

greglandrum commented 8 years ago

I just edited your comment so that formatting wasn't applied and the SMILES wasn't corrupted. If you take a look at what I did you'll see the trick.

It's fine to ask questions here, but the rdkit-discuss mailing list is probably better for questions since more people are likely to see them: https://sourceforge.net/p/rdkit/mailman/rdkit-discuss/ Note that you will need to subscribe to the list first.

greglandrum commented 8 years ago

Now to your question. I don't think I understand what you're trying to do. It really sounds like you just want to do a substructure match. Can you provide a simple example showing why a normal substructure match doesn't work?

kovasap commented 8 years ago

Thanks for the formatting. I did mean for the initial asterisks to be bullet points; I changed the comment again to show the correct strings.

My ultimate goal is to take two molecules, either or both of which may contain * groups, and determine if they are 'equal' or not. By equal here i mean that the more general molecule represents all molecules that could be represented by the other, less general molecule.

For example, the two molecules in the original comment should be considered equal since the first molecule could be made into the second one by changing its * groups. I don't think a substructure match will work because, for example, if all the * groups in the first molecule above were replaced with hydrogens a substructure match (using my method) would still say the two are equal, without considering the fact that nothing can be added to the new 'hydrogen-ized' first molecule.

greglandrum commented 8 years ago

Apologies, but I think I'm still not getting it. Let me try and construct my own simplified examples to see if I've got it right. I'm going to leave the Hs and atom mapping info off because they aren't necessary and make the example more confusing. I will also use the word "matches" instead of "equal".

query molecule matches? Note
OCCC* OCCCC yes
OCCC* OCCCCC yes
OCCC* OC(C)CCCC no substitution at unlabelled position
OCCC* COCCCCC no substitution at unlabelled position
OCCC* OCC no too small
OCCC* OCCC(C)C no extra substitution
OCCC* OCCC ? should hydrogens count as a substituent?

If I've got that much right, and the answer to the last one should be "no", then you can use the function Chem.AdjustQueryProperties() to do what you want. There's a blog post describing the thinking behind this function, along with a Python implementation, here: http://rdkit.blogspot.ch/2015/08/tuning-substructure-queries.html . I need to update that (or do a new post) to reflect the C++ implementation that's now available in the RDKit.

Here's how to use that:

qp = Chem.AdjustQueryParameters()
qp.makeDummiesQueries=True
qp.adjustDegree=True
qp.adjustDegreeFlags=Chem.ADJUST_IGNOREDUMMIES
m = Chem.MolFromSmiles('OCCC*')
qm = Chem.AdjustQueryProperties(m,qp)

And then you can see that this works by testing the entries from my table above:

In [27]: Chem.MolFromSmiles('OCCCC').HasSubstructMatch(qm)
Out[27]: True
In [28]: Chem.MolFromSmiles('OCCCCC').HasSubstructMatch(qm)
Out[28]: True
In [29]: Chem.MolFromSmiles('OC(C)CCCC').HasSubstructMatch(qm)
Out[29]: False
In [30]: Chem.MolFromSmiles('COCCCCC').HasSubstructMatch(qm)
Out[30]: False
In [31]: Chem.MolFromSmiles('OCC').HasSubstructMatch(qm)
Out[31]: False
In [32]: Chem.MolFromSmiles('OCCC').HasSubstructMatch(qm)
Out[32]: False
In [33]: Chem.MolFromSmiles('OCCC(C)C').HasSubstructMatch(qm)
Out[33]: False

You can always look at the SMARTS for your query to see what has been done:

In [35]: Chem.MolToSmarts(qm)
Out[35]: '[#8&D1]-[#6&D2]-[#6&D2]-[#6&D2]-*'

Is this what you're looking for?

kovasap commented 8 years ago

Yes! I think this will work perfectly! Thanks for the explanation.

kovasap commented 8 years ago

I also would like hydrogens to count as substituents, will that work if I just add hydrogens to all the mols? I'll try it myself too.

greglandrum commented 8 years ago

Adding Hs to the molecules and the queries will probably solve this, but it may complicate things and it will certainly make the searches take longer. It's worth trying though.

kovasap commented 8 years ago

Looks like that works. Thanks for the help!