openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
147 stars 9 forks source link

How to generate conformations #2

Closed peastman closed 2 years ago

peastman commented 3 years ago

We need to decide how to generate conformations for each molecule. We want a diverse set for each one, and we want to include both low energy and high energy conformations (keeping in mind that low energy for an MD force field might not be low energy for QM). We also want this to be fast. If selecting the conformations takes as long as doing the final computations, that will waste much of our computing budget.

Here is an initial proposal.

  1. Run high temperature MD and keep some fixed number (say 25) of conformations drawn from the trajectory. For the very small molecules we'll be working with, 1 ns at 500K should be plenty, possibly excessive.
  2. For each of those conformations, do a few minimization steps with a fast QM method to create a nearby lower energy conformation.

What method should we use for the minimization? Preferably it should be something very fast, e.g. semiempirical.

jchodera commented 3 years ago

@pavankum : Do QCFractal JSON specs enable one level of theory to be specified for minimization and another for energy/gradient computation at the end of minimization? Or does the same level of theory need to be used throughout in an OptimizationDataset?

jchodera commented 3 years ago

Run high temperature MD and keep some fixed number (say 25) of conformations drawn from the trajectory. For the very small molecules we'll be working with, 1 ns at 500K should be plenty, possibly excessive.

@peastman: If we're only generating 25 conformations, wouldn't the standard conformer generation from RDKit/OpenEye toolkits be sufficient?

pavankum commented 3 years ago

@pavankum : Do QCFractal JSON specs enable one level of theory to be specified for minimization and another for energy/gradient computation at the end of minimization? Or does the same level of theory need to be used throughout in an OptimizationDataset?

Right now we have to use the same level of theory throughout an optimization, and if we want to compute the gradients at the end with a different level of theory then we have to create another dataset from these final geometries.

jchodera commented 3 years ago

@peastman: QCFractal uses geomeTRIC for optimization, which is highly efficient in the first few steps, even (especially?) for big, complex molecules, e.g.: image Using the desired level of theory and fixing the number of steps to 5-10 would certainly make the calculations more expensive than optimizing with a very low level of theory, but would give us a wealth of data both at high energies and more reasonable energies to build models from. It would also be incredibly easy since no extra engineering is necessary, and much quicker than our normal OptimizationDataset calculations because we won't be optimizing to convergence. So it might be a decent compromise.

peastman commented 3 years ago

I was proposing something quite different from that. The full generation of conformations would be run in a local script unrelated to QCFractal. Using a fast semiempirical method, the whole process should take only seconds per conformation total and can be done on a local workstation.

Using the desired level of theory and fixing the number of steps to 5-10 would certainly make the calculations more expensive than optimizing with a very low level of theory, but would give us a wealth of data both at high energies and more reasonable energies to build models from.

That would give you 5-10 conformations that are all very close to each other, which is a highly inefficient use of computational resources. We need wide sampling of the whole conformation space of each molecule. I'm proposing doing the expensive computation for 50 conformations total for each molecule. If they're in groups of 10 that are all very close to each other, we would need many times more conformations.

If we're only generating 25 conformations, wouldn't the standard conformer generation from RDKit/OpenEye toolkits be sufficient?

It isn't suitable for that purpose. They mostly just sample dihedrals and minimize based on a MM force field. It won't give you high energy conformations and it won't sample bond lengths, angles, interatomic distances, etc. We need a realistic sampling of the sort of conformations that actually come up in MD simulations.

jchodera commented 3 years ago

That would give you 5-10 conformations that are all very close to each other image Close together in configuration space, but spanning high to low energy.

To be clear: I don't doubt your scheme would be efficient if your cheap QM surrogate on your workstation works well---I doubt you will be able to find a suitably cheap QM surrogate that does a good job of reproducing low-energy geometries of ωB97M-V/def2-TZVPPD to the point where it is actually useful and fast. (If not, why aren't we just using that level of theory?)

jthorton commented 3 years ago

Using the desired level of theory and fixing the number of steps to 5-10 would certainly make the calculations more expensive than optimizing with a very low level of theory, but would give us a wealth of data

Unfortunately, unless the geometry converges in the 10 steps this would be recorded as a failed optimization by QCFractal and no data would be saved. If we go this route the pre-optimisation would have to be done outside of QCFractal and the final geometries could be loaded in and a dataset made from them.

For fast semiempirical methods, I would recommend something like xTB which has a lot of useful features like geometry optimisation and conformation search workflows.

peastman commented 3 years ago

Since a picture is worth 1000 words, here are a few thousand words worth of pictures to illustrate why I keep saying a minimization trajectory is bad way to sample a function. First I created a random 2D function with a global minimum in the center.

image

Here is a minimization trajectory consisting of ten points.

image

I evaluated the function at those points and fit a new function to them. It looks nothing like the original one.

image

Next I sampled it at a different set of ten points, this time scattering them randomly over the area around the minimum.

image

Here is the function fit to those points. It does a much better job of reconstructing the original one.

image

It would work even better if I distributed them evenly rather than randomly. For a molecule, we could generate more conformations than we actually want, then select the subset of them whose RMSDs are maximally different from each other

jchodera commented 3 years ago

Unfortunately, unless the geometry converges in the 10 steps this would be recorded as a failed optimization by QCFractal and no data would be saved. If we go this route the pre-optimisation would have to be done outside of QCFractal and the final geometries could be loaded in and a dataset made from them.

@jthorton : Are the convergence criteria really hard-coded? If so, we should definitely expose those criteria in the JSON.

peastman commented 3 years ago

I doubt you will be able to find a suitably cheap QM surrogate that does a good job of reproducing low-energy geometries of ωB97M-V/def2-TZVPPD to the point where it is actually useful and fast. (If not, why aren't we just using that level of theory?)

It doesn't need to be especially accurate, because we don't need conformations to be highly minimized. We just need them to be distributed over the region containing the true minimum, and for the majority of them to not have crazy high energies. Let me see if I can come up with some tests to demonstrate that.

jchodera commented 3 years ago

and for the majority of them to not have crazy high energies.

That's specifically what I'm worried about. I don't think this will happen. Is there a way to check that assumption specifically?

peastman commented 3 years ago

That's what I plan to test. I'll generate some conformations for a real molecule and compare their energies with the classical force field, a fast semiempirical method, and a DFT functional (not ωB97M-V/def2-TZVPPD, but something I can compute more quickly while still being a lot more accurate than the semiempirical one).

peastman commented 3 years ago

Here are my first results. I ran MD for alanine dipeptide at 500K. I saved a conformation every 100 ps to produce five independent conformations. Next I ran a local energy minimization on each one to produce a lower energy conformation (as computed by Amber14).

For comparison, I then computed the energies of those ten conformations with ωB97X-D3/def2-TZVP. Here's a plot comparing the two methods for the ten conformations. I subtracted off the mean energy over all conformations to remove the constant offset between methods.

image

Wow.

I expected Amber to do a decent job of predicting QM energies. That's what it was trained on after all. But I didn't expect it to be quite that good.

At least as far as alanine dipeptide is concerned, clearly our concerns that MD won't produce representative conformations are unfounded. Do you expect it to be worse on other molecules? Can you suggest a different one to try?

jchodera commented 3 years ago

Can you share your code for that? This honestly looks suspiciously too good!

Any chance you could try some small molecules with gaff-1.81 or gaff-2.11 via openmmforcefields?

peastman commented 3 years ago

Here's the script that generated the conformations and computed the Amber energies.

from openmm import *
from openmm.app import *
from openmm.unit import *

pdb = PDBFile('alanine-dipeptide.pdb')
ff = ForceField('amber14-all.xml')
system = ff.createSystem(pdb.topology)
integrator = LangevinMiddleIntegrator(500.0, 1.0, 0.001)
context = Context(system, integrator, Platform.getPlatformByName('CPU'))
context.setPositions(pdb.positions)
states = []
for i in range(5):
    integrator.step(100000)
    state = context.getState(getPositions=True, getEnergy=True)
    states.append(state)
    print(state.getPotentialEnergy().in_units_of(kilocalories_per_mole))
for i in range(5):
    context.setState(states[i])
    LocalEnergyMinimizer.minimize(context)
    state = context.getState(getPositions=True, getEnergy=True)
    states.append(state)
    print(state.getPotentialEnergy().in_units_of(kilocalories_per_mole))
for i in range(len(states)):
    with open(f'state{i+1}.xml', 'w') as output:
        output.write(XmlSerializer.serialize(states[i]))

Here's the script that read them back in and computed the energies with Psi4.

from openmm import *
from openmm.app import *
from openmm.unit import *
import psi4

pdb = PDBFile('alanine-dipeptide.pdb')
atoms = list(pdb.topology.atoms())

def computeEnergy(state):
    lines = ['22', '0 1']
    coords = state.getPositions().value_in_unit(angstrom)
    for atom, c in zip(atoms, coords):
        lines.append(f'{atom.element.symbol}    {c[0]}    {c[1]}  {c[2]}')
    mol = psi4.geometry('\n'.join(lines))
    energy = psi4.energy('WB97X-D3/def2-TZVP', molecule=mol)
    return energy*psi4.constants.hartree2kcalmol

energies = []
for i in range(10):
    with open(f'state{i+1}.xml') as input:
        state = XmlSerializer.deserialize(input.read())
        energies.append(computeEnergy(state))

for energy in energies:
    print(energy)
peastman commented 3 years ago

If you want to run it yourself, the input PDB file I used is https://github.com/openmm/openmm/blob/master/wrappers/python/tests/systems/alanine-dipeptide-implicit.pdb.

jchodera commented 3 years ago

Do you really have to form the geometry input manually? Is there a way to go from any reasonable molecule representation (OEMol, RDMol OpenFF Molecule) into a molecule for psi4?

peastman commented 3 years ago

I based it on the documentation at https://psicode.org/psi4manual/master/psiapi.html. I don't know if there's another way of doing it. That's how the examples all work.

peastman commented 3 years ago

I looked through the MiniDrugBank file and arbitrarily chose the one with ID DrugBank_5373. It's 28 atoms (I wanted a small one to keep it fast), uncharged, and includes a sulfur atom. For the MD force field I used OpenFF 2.0.0. I repeated the same procedure exactly as before. Here's the plot of energies in kcal/mol.

image

It's not quite as good as Amber, but still plenty accurate enough to use for generating conformations. It's impressive to see a generic force field doing this well.

peastman commented 3 years ago

Based on these results, here's a revised proposal.

  1. Run 1 ns of MD at 500K, saving a state every 10 ps.
  2. Out of those 100 states, select 25 that are chosen to maximize the RMSD between states.
  3. For each of those, perform a few iterations of L-BFGS minimization to generate a lower energy (but not fully minimized) state.
  4. Include both the higher energy and partially minimized states (50 total for each molecule).

How many iterations of minimization should we perform? Each iteration is a line minimization, not just a single step, so the energy goes down quickly. A way of looking at that is, "What fraction of the excess energy should we remove?" I ran this procedure on a dipeptide consisting of two valine residues, minimizing for varying numbers of iterations. Here's how the energy decreases, averaged over 25 structures. 1.0 means the energy before minimizing and 0.0 is the energy after minimizing to convergence.

image

After just two iterations, it has already removed half the "excitation" energy (above the nearest local minimum). By five iterations, it has removed 75% of it.

jchodera commented 3 years ago

Shouldn't we are least generate diverse conformers before running MD? Otherwise we may not even escape torsion barriers or see ring puckerings.

We could alternatively run very long simulations.

peastman commented 3 years ago

I'll try it and see if it makes a difference.

peastman commented 3 years ago

That does help increase the overall diversity. Ok, so modify step 1 to

davidlmobley commented 3 years ago

Diverse conformers should be a good contribution.

(This is a very interesting thread.)

SimonBoothroyd commented 3 years ago

That does help increase the overall diversity. Ok, so modify step 1 to

This is what I was thinking as well - good to see that it helps!

How many iterations of minimization should we perform? Each iteration is a line minimization, not just a single step, so the energy goes down quickly.

It would be interesting to see the spread if we just i) took each diverse 500K conformer ii) ran a v. short simulation at 298K and took this conformer, and iii) do a full L-BFGS minimization so 3 conformers per each step 2) conformer.

It may also be good to add a step 5) where you do a quick round of pruning based on the final RMSD matrix to handle the cases where the 'diverse' high energy conformers have minimised / quenched to the same / very similar minima.

peastman commented 3 years ago

It may also be good to add a step 5) where you do a quick round of pruning based on the final RMSD matrix to handle the cases where the 'diverse' high energy conformers have minimised / quenched to the same / very similar minima.

I also thought about that. If we were fully minimizing the conformations that would be important, since you could easily get two converging to the same minimum. But if we're only doing 2 or 3 iterations of minimization I don't think it's necessary. They'll still be different from each other. And if several states all converge toward the same minimum, that suggests it's a place where simulations will spend a lot of time. Having multiple samples in the area surrounding that minimum is a good thing.

peastman commented 3 years ago

I've been running experiments with the molecules in #5 to see how well this system works at generating diverse conformations. For the 25 high energy conformations it works quite well, with typical RMSD between conformations of around 3 A. I'm less sure about the energy minimization, mostly because I'm just not sure what's a good metric.

Three iterations of minimization has only a tiny effect on the coordinates. RMSD from the original conformation is typically only about 0.04 A. On the other hand, that still has a huge effect on the forces and energy. The average force magnitude is around 3x smaller. So it moves a very short distance in a steeply downhill direction, leading to a much lower energy conformation. Is that good or bad? I'm really not sure.

Another option is to follow up the minimization with a bit more MD, but at much lower temperature. If I run 1 ps at 100K, the RMSD goes up to around 1 A and the average force magnitude is still about 2-3x smaller than before minimizing. Possibly that will be more useful for sampling the energy surface?

peastman commented 2 years ago

Closing since version 1 is now released.