openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
133 stars 6 forks source link

Explore various dataset generation strategies on simplified chemical space #89

Open jchodera opened 6 months ago

jchodera commented 6 months ago

As we continue to explore the best ways to generate data for future iterations of SPICE, it would be useful to apply a variety of dataset generation strategies to a simplified chemical space to enable experiments that can help identify the most useful strategies.

OpenFF has made extensive use of the AlkEthOH dataset that contains only three elements (C, H, O), making it feasible to relatively exhaustively explore the relevant chemical space. The name "AlkEthOH" refers to "alkanes, ethers, and alcohols (OH)".

A few subsets have already been generated at the OpenFF default level of theory by QCFractal here:

Examples are below:

AlkEthOH chain molecules AlkEthOH_chain.pdf image

AlkEthOH with rings AlkEthOH_rings.pdf image

PhAlkEthOH PhEthOH.pdf image

We could generate several kinds of datasets:

peastman commented 6 months ago

Let's define exactly what questions we want to answer.

Our goal is to train a model that

  1. Produces accurate energies.
  2. Produces stable trajectories.
  3. Generalizes to new molecules (within the very limited space of only three elements), including ones larger than anything in the training set.

The questions we want to answer are

  1. How many molecules should the training set contain?
  2. How large should they be?
  3. How many conformations should it have for each molecule?
  4. How should the conformations be selected?

We can use GFN2-xTB to generate lots of data very quickly. It would include lots of molecules of varying sizes, with lots of conformations for each one generated in different ways. Then we could train models on lots of subsets to see how well they achieve the goals.

This would make for an interesting paper.

peastman commented 6 months ago

Here's a more concrete proposal for how this could be done.

For every molecule, start by having RDKit generate five conformations. Starting from each one, generate ten conformations in each of several ways.

That would be a total of 550 conformations for each molecule. We would compute forces and energies with GFN2-xTB. We could then train models on a variety of subsets, evaluating each one to see how well it works on an independent test set (accuracy of forces and energies, stability of trajectories). Here are some tests we could do.

jchodera commented 6 months ago

I agree with your assessment of the goals and questions (though may add one more: "What is the best way to select molecules?")

The suggestion to first generate this data with GFN2-xTB seems reasonable, though there would appear to be value in subsequently repeating it for a true QM level of theory (even if just the faster OpenFF level of theory).

Simply add random offsets to the atom positions. This would be repeated with three different magnitudes for the displacements, chosen to give roughly similar variation to dynamics with the three temperatures.

What is the rationale behind this choice?

For generating conformers, I think the emphasis on keeping each dataset to 50 conformers/molecule hinders us from addressing some questions. In particular:

  1. It would be useful to assess the utility of the OpenFF OptimizationDataset, where a complete optimization trajectory from each of the 5 original conformers is generated.
  2. It would also be useful to have some thermalized data around the QM-optimized conformers---perturbations away from the equilibrium geometry that could inform how well we represent the true shape of, say, bonds and angles around their minimum values. It would be better to generate these perturbations with very low-temperature dynamics. This could be initiated from the end of the QM optimizations in (1).
  3. It would also be useful to have data for several QM minimization steps (5? 10?) initiated from each of the MM-thermalized snapshots generated so we can assess how the minimization process improves representation of the equilibrium ensemble under the QM-like model. This will tell us whether the strategy of running a couple of optimization steps will significantly improve the quality of the existing SPICE dataset.
peastman commented 6 months ago

What is the rationale behind this choice?

It's very cheap to do, especially compared to running dynamics with a semi-empirical method, or even worse doing an optimization with the full QM method. And maybe it works just as well. The goal is to find out.

It would be useful to assess the utility of the OpenFF OptimizationDataset, where a complete optimization trajectory from each of the 5 original conformers is generated.

That's one of the methods I suggested: an optimization trajectory with the full QM method, which in this case is just GFN2-xTB, but for a real dataset would be something more expensive.