Decide good defaults for RDKitToolkitWrapper.generate_conformers

j-wags commented 5 years ago

We don't want to add lots of knobs to our generate_conformers calls, but it would probably be good if they have similar behavior between toolkits. I tried to implement comparable versions between OpenEye and RDKit, but end up getting the following results:

toolkit_wrapper = OpenEyeToolkitWrapper()
smiles = 'CCCCCCC'
molecule = toolkit_wrapper.from_smiles(smiles)
molecule.generate_conformers(toolkit_registry=toolkit_wrapper)
print(len(molecule._conformers))

3

toolkit_wrapper = RDKitToolkitWrapper()
smiles = 'CCCCCCC'
molecule = toolkit_wrapper.from_smiles(smiles)
molecule.generate_conformers(toolkit_registry=toolkit_wrapper)
print(len(molecule._conformers))

463

I'm concerned that there's such a big difference in the number of output conformers.

The implementations in the toolkits are as follows:

Openeye:

        omega = oeomega.OEOmega()
        omega.SetMaxConfs(800)
        omega.SetCanonOrder(False)
        omega.SetSampleHydrogens(True)
        omega.SetEnergyWindow(15.0)
        omega.SetRMSThreshold(1.0)
        #Don't generate random stereoisomer if not specified
        omega.SetStrictStereo(True) 
        status = omega(oemol)

RDKit:

        AllChem.EmbedMultipleConfs(rdmol,
                                   numConfs=800,
                                   pruneRmsThresh=1.0,
                                   randomSeed=1,
                                  )

I suspect the difference is due to the lack of an energy cutoff in the RDKit version, but I couldn't find a way to set it. Admittedly, I've have limited experience with this part of RDKit.

@hjuinj, do you have any advice on how to make these give comparable results?

j-wags commented 5 years ago

Update: Not adding hydrogens to the rdmol keeps the number of generated conformers down. However, I don't want to remove H's before conformer generation and add them back after, as the geometry probably won't be good, and we intend to use these conformers for partial charge calculation.

Example:

from rdkit import Chem
from rdkit.Chem import AllChem
rdmol = Chem.MolFromSmiles('CCCCCCC')
rdmol = Chem.AddHs(rdmol)
AllChem.EmbedMultipleConfs(rdmol,
                            numConfs=800,
                            pruneRmsThresh=1.0,
                            randomSeed=1,
                          )
print(rdmol.GetNumConformers())

463

from rdkit import Chem
from rdkit.Chem import AllChem
rdmol = Chem.MolFromSmiles('CCCCCCC')
#rdmol = Chem.AddHs(rdmol)
AllChem.EmbedMultipleConfs(rdmol,
                            numConfs=800,
                            pruneRmsThresh=1.0,
                            randomSeed=1,
                          )
print(rdmol.GetNumConformers())

5

davidlmobley commented 5 years ago

I would guess that the top few conformers are similar.

Presumably we aren't actually going to use all of the conformers for anything, so I would guess that this would be an unimportant distinction, and would plunge ahead and punt this to a future PR or and follow up on it in testing later. This is also an issue it would be easy to involve someone OTHER than you in, presumably, since the main things to do would be: a) check that the top conformers are similar (safety/sanity check), and b) see if there's a way to turn down how many RDKit returns (based on energy) and if not, talk to the RDKit devs about potentially implementing one

j-wags commented 5 years ago

I agree, David. You had mentioned making a good first issue tag the other day, so I've gone ahead and done that.

hjuinj commented 5 years ago

my understanding of the RDKit distance geometry conformer generation is that the conformers generated are not ranked by their conformer indices. There is no energy threshold or energy ranking, but as Jeff showed in the code above, a pruneRmsThresh to increase the dissimilarity between the generated conformers exists. But the generator will always try to give you back the number of conformers you specified by numConfs. And all hydrogens should be explicit.

j-wags commented 5 years ago

Thanks for the feedback, Shuzhe.

I suspect the problem might actually be with how pruneRmsThresh classifies "heavy atoms". In the documentation[1], it says that the RMSD is computed only for heavy atoms, however the large increase in output confs when H's are added implies that maybe the RMSD algorithm is considering them as heavy.

As David said, I'm not going to try to solve this right now. It could be a good issue for new people on the project to delve into, and we can keep adding any notes/thoughts that come up here.

[1] https://www.rdkit.org/docs/source/rdkit.Chem.rdDistGeom.html?highlight=embedmultipleconfs#rdkit.Chem.rdDistGeom.EmbedMultipleConfs

openforcefield / openff-toolkit

Decide good defaults for RDKitToolkitWrapper.generate_conformers #123

The implementations in the toolkits are as follows: