Closed kovasap closed 8 years ago
I just edited your comment so that formatting wasn't applied and the SMILES wasn't corrupted. If you take a look at what I did you'll see the trick.
It's fine to ask questions here, but the rdkit-discuss mailing list is probably better for questions since more people are likely to see them: https://sourceforge.net/p/rdkit/mailman/rdkit-discuss/ Note that you will need to subscribe to the list first.
Now to your question. I don't think I understand what you're trying to do. It really sounds like you just want to do a substructure match. Can you provide a simple example showing why a normal substructure match doesn't work?
Thanks for the formatting. I did mean for the initial asterisks to be bullet points; I changed the comment again to show the correct strings.
My ultimate goal is to take two molecules, either or both of which may contain * groups, and determine if they are 'equal' or not. By equal here i mean that the more general molecule represents all molecules that could be represented by the other, less general molecule.
For example, the two molecules in the original comment should be considered equal since the first molecule could be made into the second one by changing its * groups. I don't think a substructure match will work because, for example, if all the * groups in the first molecule above were replaced with hydrogens a substructure match (using my method) would still say the two are equal, without considering the fact that nothing can be added to the new 'hydrogen-ized' first molecule.
Apologies, but I think I'm still not getting it. Let me try and construct my own simplified examples to see if I've got it right. I'm going to leave the Hs and atom mapping info off because they aren't necessary and make the example more confusing. I will also use the word "matches" instead of "equal".
query | molecule | matches? | Note |
---|---|---|---|
OCCC* |
OCCCC |
yes | |
OCCC* |
OCCCCC |
yes | |
OCCC* |
OC(C)CCCC |
no | substitution at unlabelled position |
OCCC* |
COCCCCC |
no | substitution at unlabelled position |
OCCC* |
OCC |
no | too small |
OCCC* |
OCCC(C)C |
no | extra substitution |
OCCC* |
OCCC |
? | should hydrogens count as a substituent? |
If I've got that much right, and the answer to the last one should be "no", then you can use the function Chem.AdjustQueryProperties()
to do what you want. There's a blog post describing the thinking behind this function, along with a Python implementation, here: http://rdkit.blogspot.ch/2015/08/tuning-substructure-queries.html . I need to update that (or do a new post) to reflect the C++ implementation that's now available in the RDKit.
Here's how to use that:
qp = Chem.AdjustQueryParameters()
qp.makeDummiesQueries=True
qp.adjustDegree=True
qp.adjustDegreeFlags=Chem.ADJUST_IGNOREDUMMIES
m = Chem.MolFromSmiles('OCCC*')
qm = Chem.AdjustQueryProperties(m,qp)
And then you can see that this works by testing the entries from my table above:
In [27]: Chem.MolFromSmiles('OCCCC').HasSubstructMatch(qm)
Out[27]: True
In [28]: Chem.MolFromSmiles('OCCCCC').HasSubstructMatch(qm)
Out[28]: True
In [29]: Chem.MolFromSmiles('OC(C)CCCC').HasSubstructMatch(qm)
Out[29]: False
In [30]: Chem.MolFromSmiles('COCCCCC').HasSubstructMatch(qm)
Out[30]: False
In [31]: Chem.MolFromSmiles('OCC').HasSubstructMatch(qm)
Out[31]: False
In [32]: Chem.MolFromSmiles('OCCC').HasSubstructMatch(qm)
Out[32]: False
In [33]: Chem.MolFromSmiles('OCCC(C)C').HasSubstructMatch(qm)
Out[33]: False
You can always look at the SMARTS for your query to see what has been done:
In [35]: Chem.MolToSmarts(qm)
Out[35]: '[#8&D1]-[#6&D2]-[#6&D2]-[#6&D2]-*'
Is this what you're looking for?
Yes! I think this will work perfectly! Thanks for the explanation.
I also would like hydrogens to count as substituents, will that work if I just add hydrogens to all the mols? I'll try it myself too.
Adding Hs to the molecules and the queries will probably solve this, but it may complicate things and it will certainly make the searches take longer. It's worth trying though.
Looks like that works. Thanks for the help!
Is it possible for rdkit to do substructure matching with * atoms (general R groups)? I'm trying to see if two molecules are effectively identical (one molecule's * atoms match the other's fully specified groups) through substructure matching. I have been doing this by getting each molecule's substructure matches with the other and then comparing the length of the longest match (the number of atoms that matched) with the number of atoms in the molecule. If the lengths are the same, I would say the molecules are identical. Currently this isn't working because of the way substructure matching handles * atoms. Is there a better way to do this?
Here are two smiles strings that my method should say are identical:
If there's a better place to post questions like this please let me know!