Explore various dataset generation strategies on simplified chemical space

openmm / spice-dataset

A collection of QM data for training potential functions

MIT License

133 stars 6 forks source link

Explore various dataset generation strategies on simplified chemical space #89

Open jchodera opened 6 months ago

jchodera commented 6 months ago

As we continue to explore the best ways to generate data for future iterations of SPICE, it would be useful to apply a variety of dataset generation strategies to a simplified chemical space to enable experiments that can help identify the most useful strategies.

OpenFF has made extensive use of the AlkEthOH dataset that contains only three elements (C, H, O), making it feasible to relatively exhaustively explore the relevant chemical space. The name "AlkEthOH" refers to "alkanes, ethers, and alcohols (OH)".

A few subsets have already been generated at the OpenFF default level of theory by QCFractal here:

AlkEthOH chain molecules : 1303 molecules
AlkEthOH ring-containing molecules : 1156 molecules
PhEthOH (AlkEthOH with phenyl substituents) : 5082 molecules

Examples are below:

AlkEthOH chain molecules AlkEthOH_chain.pdf

AlkEthOH with rings AlkEthOH_rings.pdf

PhAlkEthOH PhEthOH.pdf

We could generate several kinds of datasets:

MD snapshots generated with an MM force field (e.g. OpenFF or GAFF)
MD snapshots generated with GFN2-xTB
An OptimizationDataset from RDKit-enumerated conformers
MD snapshots generated with an MM force field but used in an OptimizationDataset with the number of minimization steps limited to 3-4 steps (requires an argument to geomeTRIC to register success even if the convergence tolerance is not met)
A TorsionDriveDataset
etc.

peastman commented 6 months ago

Let's define exactly what questions we want to answer.

Our goal is to train a model that

Produces accurate energies.
Produces stable trajectories.
Generalizes to new molecules (within the very limited space of only three elements), including ones larger than anything in the training set.

The questions we want to answer are

How many molecules should the training set contain?
How large should they be?
How many conformations should it have for each molecule?
How should the conformations be selected?

We can use GFN2-xTB to generate lots of data very quickly. It would include lots of molecules of varying sizes, with lots of conformations for each one generated in different ways. Then we could train models on lots of subsets to see how well they achieve the goals.

This would make for an interesting paper.

peastman commented 6 months ago

Here's a more concrete proposal for how this could be done.

For every molecule, start by having RDKit generate five conformations. Starting from each one, generate ten conformations in each of several ways.

Run dynamics with either OpenFF or GFN2-xTB, at a temperature of either 300K, 500K, or 700K.
Generate a ten step optimization trajectory with either OpenFF or GFN2-xTB.
Simply add random offsets to the atom positions. This would be repeated with three different magnitudes for the displacements, chosen to give roughly similar variation to dynamics with the three temperatures.

That would be a total of 550 conformations for each molecule. We would compute forces and energies with GFN2-xTB. We could then train models on a variety of subsets, evaluating each one to see how well it works on an independent test set (accuracy of forces and energies, stability of trajectories). Here are some tests we could do.

Compare conformations generated with each of the two force fields. Does one lead to better models than the other?
Compare conformations generated with dynamics to ones generated with random offsets. Does one work better than the other?
Try omitting the optimization trajectories, compared to including them but omitting an equal number of conformations generated in other ways.
Include only the conformations from one temperature, versus an equal number of conformations generated at two or all three temperatures.
Try varying the total amount of data, either by omitting a subset of molecules or by keeping all molecules but omitting a subset of the conformations for each one. Does it matter which you do, or is the total amount of data what matters?
- The subset could be selected in a few different ways: randomly, or by choosing molecules/conformations that are maximally different from each other, or by limiting it to only the largest or smallest molecules. What effect does it have?

jchodera commented 6 months ago

I agree with your assessment of the goals and questions (though may add one more: "What is the best way to select molecules?")

The suggestion to first generate this data with GFN2-xTB seems reasonable, though there would appear to be value in subsequently repeating it for a true QM level of theory (even if just the faster OpenFF level of theory).

Simply add random offsets to the atom positions. This would be repeated with three different magnitudes for the displacements, chosen to give roughly similar variation to dynamics with the three temperatures.

What is the rationale behind this choice?

For generating conformers, I think the emphasis on keeping each dataset to 50 conformers/molecule hinders us from addressing some questions. In particular:

It would be useful to assess the utility of the OpenFF OptimizationDataset, where a complete optimization trajectory from each of the 5 original conformers is generated.
It would also be useful to have some thermalized data around the QM-optimized conformers---perturbations away from the equilibrium geometry that could inform how well we represent the true shape of, say, bonds and angles around their minimum values. It would be better to generate these perturbations with very low-temperature dynamics. This could be initiated from the end of the QM optimizations in (1).
It would also be useful to have data for several QM minimization steps (5? 10?) initiated from each of the MM-thermalized snapshots generated so we can assess how the minimization process improves representation of the equilibrium ensemble under the QM-like model. This will tell us whether the strategy of running a couple of optimization steps will significantly improve the quality of the existing SPICE dataset.

peastman commented 6 months ago

What is the rationale behind this choice?

It's very cheap to do, especially compared to running dynamics with a semi-empirical method, or even worse doing an optimization with the full QM method. And maybe it works just as well. The goal is to find out.

It would be useful to assess the utility of the OpenFF OptimizationDataset, where a complete optimization trajectory from each of the 5 original conformers is generated.

That's one of the methods I suggested: an optimization trajectory with the full QM method, which in this case is just GFN2-xTB, but for a real dataset would be something more expensive.