Open jchodera opened 6 months ago
Let's define exactly what questions we want to answer.
Our goal is to train a model that
The questions we want to answer are
We can use GFN2-xTB to generate lots of data very quickly. It would include lots of molecules of varying sizes, with lots of conformations for each one generated in different ways. Then we could train models on lots of subsets to see how well they achieve the goals.
This would make for an interesting paper.
Here's a more concrete proposal for how this could be done.
For every molecule, start by having RDKit generate five conformations. Starting from each one, generate ten conformations in each of several ways.
That would be a total of 550 conformations for each molecule. We would compute forces and energies with GFN2-xTB. We could then train models on a variety of subsets, evaluating each one to see how well it works on an independent test set (accuracy of forces and energies, stability of trajectories). Here are some tests we could do.
I agree with your assessment of the goals and questions (though may add one more: "What is the best way to select molecules?")
The suggestion to first generate this data with GFN2-xTB seems reasonable, though there would appear to be value in subsequently repeating it for a true QM level of theory (even if just the faster OpenFF level of theory).
Simply add random offsets to the atom positions. This would be repeated with three different magnitudes for the displacements, chosen to give roughly similar variation to dynamics with the three temperatures.
What is the rationale behind this choice?
For generating conformers, I think the emphasis on keeping each dataset to 50 conformers/molecule hinders us from addressing some questions. In particular:
OptimizationDataset
, where a complete optimization trajectory from each of the 5 original conformers is generated.What is the rationale behind this choice?
It's very cheap to do, especially compared to running dynamics with a semi-empirical method, or even worse doing an optimization with the full QM method. And maybe it works just as well. The goal is to find out.
It would be useful to assess the utility of the OpenFF OptimizationDataset, where a complete optimization trajectory from each of the 5 original conformers is generated.
That's one of the methods I suggested: an optimization trajectory with the full QM method, which in this case is just GFN2-xTB, but for a real dataset would be something more expensive.
As we continue to explore the best ways to generate data for future iterations of SPICE, it would be useful to apply a variety of dataset generation strategies to a simplified chemical space to enable experiments that can help identify the most useful strategies.
OpenFF has made extensive use of the AlkEthOH dataset that contains only three elements (C, H, O), making it feasible to relatively exhaustively explore the relevant chemical space. The name "AlkEthOH" refers to "alkanes, ethers, and alcohols (OH)".
A few subsets have already been generated at the OpenFF
default
level of theory by QCFractal here:Examples are below:
AlkEthOH chain molecules AlkEthOH_chain.pdf![image](https://github.com/openmm/spice-dataset/assets/3656088/776f00dc-fca5-4e87-a202-64df64f2cb23)
AlkEthOH with rings AlkEthOH_rings.pdf![image](https://github.com/openmm/spice-dataset/assets/3656088/bd44b215-df06-4f2e-98b0-669a53421673)
PhAlkEthOH PhEthOH.pdf![image](https://github.com/openmm/spice-dataset/assets/3656088/a7bcf563-8101-4c4a-b00e-481884cd25c9)
We could generate several kinds of datasets:
OptimizationDataset
from RDKit-enumerated conformersOptimizationDataset
with the number of minimization steps limited to 3-4 steps (requires an argument to geomeTRIC to register success even if the convergence tolerance is not met)TorsionDriveDataset