Simulating with trained models

peastman commented 1 year ago

I've started investigating whether it's possible to run stable MD simulations using models trained on SPICE. I'm opening this issue as a place to describe my results and discuss approaches.

As a starting point, I tried running simulations with the most accurate of the five models described in the paper. As part of the analysis, I had sorted every molecule in the dataset by MAE across all its samples. Intuitively, I would expect that molecules with low error would be more likely to be stable than ones with high error. So I took the sorted list of molecules and chose 21 that were evenly spaced over the first third of the list. This gives us an assortment of molecules that cover a range of MAE, but all of them are ones on which the model is at least reasonably accurate.

For each molecule, I had OpenFF-Toolkit generate a conformation with generate_conformers(). This gives us a conformation that isn't energy minimized but should be generally reasonable. I used the model to compute the potential energy and mean force magnitude (over all atoms). If the mean force was less than 2000 kJ/mol/nm, I attempted to run a short MD simulation (5 ps at 300K with a time step of 0.5 fs) and recorded whether it was stable. I considered a simulation to be stable if at the end, the potential energy was negative and the mean force magnitude was less than 5000 kJ/mol/nm. That isn't a very stringent test, but it's good enough to detect most cases where the simulation is obviously unstable.

Next I went back to the conformation generated by OpenFF-Toolkit, performed an energy minimization, and repeated the process again: compute the energy and mean force, and attempt a simulation if the mean force is less than 2000 kJ/mol/nm.

Here are the results. All energies are in kJ/mol, and all forces are in kJ/mol/nm.

SMILES	MAE	Initial Energy	Initial Force	Initial Stable	Minimized Energy	Minimized Force	Minimized Stable
O.[Br-]	0.52	-697.07	174.04	Yes	-1098.26	4.70	Yes
CF.O	1.16	-2137.35	379.27	Yes	-3016.61	6.78	Yes
CF.c1c[nH+]c[nH]1	1.44	-3444.68	1385.62	No	-51228.83	295.14	No
CNC(C)=O.O	1.66	-3919.71	2905.47	-27933.60	197892.18
C1COCOC1.c1ccccc1	1.83	-1597.14	13229.65	-200872.56	238.82	No
CCSc1c(Cl)c(Cl)nc(Cl)c1Cl	2.00	-10025.96	6219.34	-91788.04	101.68	Yes
BrCCBr.C=C	2.19	-3509.03	1497.34	Yes	-8341.84	185.78	Yes
Brc1nc(Br)c2[nH]cnc2n1	2.36	-2326.06	1118.34	Yes	-5735.17	5314.33
COC=O.c1c[nH]cn1	2.53	-4924.22	4972.37	-37282.98	49163.46
FCF.Oc1ccccc1	2.69	-5243.82	1844.34	Yes	-17793.25	3832.58
CC(C)=O.CI	2.85	-2624.62	2091.93	-9437.92	9966.54
CC1(C)Oc2cc(C(=O)c3ccncc3F)cc(O)c2[C@@H]2CC@HCC[C@H]21	3.00	67250.36	976154.64	-37277040.00	75624193.94
NS(=O)(=O)Oc1ccc2c(c1)OCC=C2	3.14	7196.77	34369.72	-120631.40	1771573.72
CC(C)=O.CC(N)=O	3.28	-3794.45	2386.33	-13649.93	103504.58
CC(I)I	3.41	-2027.48	400.34	Yes	-2574.42	3.05	Yes
CNC(N)=[NH2+].[F-]	3.52	-2136.38	6432.99	-7764.10	15876.81
CC1(O)CCC(C(C)(O)C(=O)O)C(=O)C1	3.63	2083.24	27395.65	-275149.12	6261686.46
c1ccc2c(C[n+]3csc4ccccc43)cccc2c1	3.73	144824.92	277838.14	-9574604.00	45962222.49
O=C([O-])C12CCC(CC1)[NH2+]2	3.83	-2885.28	5103.62	-66048.79	692112.47
CNC(=O)C@HNC(=O)C@@HNC(C)=O	3.93	109911.96	203436.32	-34306652.00	40925424.19
CCCCS(=O)(=O)c1sc2nc3c(cc2c1N)CCCC3	4.02	516.54	172413.99	-9241010.00	14776114.24

Some observations:

There exist molecules for which it's able to run a stable simulation, at least for 5 ps. So it's a start!
When the initial force magnitude is low enough for simulating to be reasonable, the simulation usually but not always is stable.
For the majority of molecules, the energy minimizer manages to get into a region where the model is clearly inaccurate, the energy decreases drastically, and the minimizer gives up. Pay especial attention to the Minimized Force column. The minimizer used a tolerance of 10 kJ/mol/nm, so if everything worked correctly the value should be less than that. In fact it only managed to do that in 3 of 21 cases. There were a few others where it at least managed to get the force down into a reasonable range before giving up. But in the majority of cases, the force is larger after minimization than before.
As expected, minimization and simulation tend to work better on the molecules with lower MAE on the dataset, but there are exceptions.

peastman commented 1 year ago

I analyzed the cases where the optimizer led to huge forces. In every case, it did it by putting two atoms directly on top of each other (distance between them on the order of 1e-5 nm or less). It's not surprising the model would have no idea what to do in that case, since there's no training data with atoms right on top of each other. You would hope there would be enough of a barrier to keep them from getting too close, but L-BFGS works through line minimizations that are pretty good at jumping over barriers.

A possible solution would be to add an explicit repulsion term to the model. For example, SpookyNet includes a ZBL potential as a prior to ensure physically realistic behavior at short distances. That seems like a good thing to try.

jchodera commented 1 year ago

Starting with a physically motivated "prior" potential and forcing the ML model to learn the difference sounds like a good approach if the domain of desired application matches the domain over which the model is reasonable.

The SpookyNet paper describes the ZBL potential as:

It would appear that this should describe a domain including both bonded and noncovalent interactions, which is useful to us.

I am still very much hoping we can eliminate the need for formal charges from our model, since it vastly complicates things by making the model dependent on a specific cheminformatics model and drastically limits its domain of applicability. Since @raimis has achieved equivalent performance without the need for these charges, can we first push the model as far as we can without the need for charges while including something like ZBL to eliminate regions outside out training domain?

One other note about our domain of applicability: Since we used thermalized samples from an MM potential, we effectively excluded anything in the high-energy region of the MM model---as well as snapshots near minima---from our domain of applicability (though some small amount of extrapolation is expected from this region since we use QM forces even though we constrain to MM positions).

@peastman : Do you know if our QM-generated dataset ended up including snapshots with gradients that were close to a minimum? If not, and we want users to be able to minimize these structures, why don't we add some OptimizationDatasets for some subsets, so that we also include data in the vicinity of minima?

peastman commented 1 year ago

Do you know if our QM-generated dataset ended up including snapshots with gradients that were close to a minimum?

Nearly half the data in the dataset was for low energy conformations close to local minima. See the paper for a descriptions of how it was generated.

jchodera commented 1 year ago

Ah, yes---low energy with respect to MM. (I had forgotten about that!)

Were you able to verify the QM force magnitudes were also low for these conformations? I realize a test of the hypothesis that minimized MM conformations were also close to being QM minima (or at least low-force snapshots) would have been a good addition to the manuscript.

peastman commented 1 year ago

I've been training models with a variety of physical priors. Here's what I've tried so far, and what effect each one has.

ZBL repulsive potential

This models the screened Coulomb repulsion between nuclei at short distances. Adding it had no significant effect on the accuracy of fitting data. A model including it came in right in the middle of the five models I had trained before. But it has a huge effect on stability. I tested the same 21 molecules as before, and this time it successfully minimized and simulated 17 of them (compared to five before). That's only for 1000 steps, so it's not the most rigorous test, but it's still a huge improvement.

D2 dispersion correction

Including this maybe helped accuracy a little bit? I trained two models including it. Both of them ended up with better than average accuracy for the previous five models, but not as good as the best of them. If it does help, the benefit is smaller than the variation between models.

Coulomb interaction

I implemented a Coulomb interaction between atoms, scaled by erf(alpha*r) to reduce the effect at short distances. I haven't yet implemented any way for the model to predict charges, so this was limited to fixed, pre-assigned partial charges. I tried two different sets of charges: formal charges, which give only a very approximate picture of how charge is distributed in the molecule, and MMFF94 charges, which give a more realistic distribution while still being conformation-independent and fast to compute.

Both made the model worse. Formal charges only hurt accuracy a little bit. MMFF94 charges hurt it more.

jchodera commented 1 year ago

Thanks for the reports!

Can you link to the implementations of these models in a PR somewhere?

An important consideration is to keep the potential free from singularities and ensure C2 continuity. Are we careful to select priors with both properties? (The energy can be large at r=0, but singularities cause significant problems for many sampling strategies and alchemical methods. C2 continuity is required for molecular dynamics integrators to provide some guarantee of accuracy.)

In terms of other potential physical priors I've seen used, this paper uses GFN2-xTB as a reference model, with the ML model predicting the difference between reference and QM (delta-learning). While it would take a good amount of effort to implement GFN2-xTB, the xtb Python package would make it straightforward to precompute the GFN2-xTB contributions for each snapshot and assess how well this method would work for accuracy; dynamics simulation would be more difficult to check.

For charges, I still really like the Behler charge model that provides a way to treat both polarization and charge transfer. Would it be possible to try something like this without too much added complexity?

peastman commented 1 year ago

All implementations are in TorchMD-Net (though the PR with the Coulomb implementation hasn't been merged yet). The dispersion and Coulomb priors are both free from singularities. As we've discussed several times already, the repulsion prior is singular at r=0 because that's an accurate description of the physical system. The repulsion between two atoms really does go to infinity as they approach each other.

peastman commented 1 year ago

this paper uses GFN2-xTB as a reference model

There are a number of models of that sort, that start with a semi-empirical method and train a neural network to make it more accurate. Those models are quite interesting, but they're useful for different applications. Even the fastest semi-empirical models are far slower than the sorts of ML models I've been looking at. They aren't practical for MD simulations of the sort we want to run. Think of them as a faster alternative to DFT instead of a more accurate alternative to force fields.

Would it be possible to try something like this without too much added complexity?

It's on my list of things to try. We first need architectural changes to TorchMD-Net to enable priors to depend on outputs of the model.

jchodera commented 1 year ago

the repulsion prior is singular at r=0 because that's an accurate description of the physical system.

These models are not trained on any data close to r=0 and are not at all intended to or capable of modeling the correct energetics near r=0. There is no reason they need to be singular at r=0, since this regime is far from the regime in which these models are valid or accurate. Unmodified simulations will not visit this regime, but enhanced sampling techniques and alchemical modifications will certainly visit these regimes, where complicated and inconvenient modifications would be needed if the potential is singular at r=0. My point is that this can be avoided by simply using a functional form free of singularities to model this region that is outside the domain of applicability of our model.

There are a number of models of that sort, that start with a semi-empirical method and train a neural network to make it more accurate. Those models are quite interesting, but they're useful for different applications. Even the fastest semi-empirical models are far slower than the sorts of ML models I've been looking at. They aren't practical for MD simulations of the sort we want to run. Think of them as a faster alternative to DFT instead of a more accurate alternative to force fields.

There are a number of models of that sort, that start with a semi-empirical method and train a neural network to make it more accurate. Those models are quite interesting, but they're useful for different applications. Even the fastest semi-empirical models are far slower than the sorts of ML models I've been looking at. They aren't practical for MD simulations of the sort we want to run. Think of them as a faster alternative to DFT instead of a more accurate alternative to force fields.

Is the main issue the energy computation by variational optimization or dominant eigenvalue computation, or the integral computations? I keep wondering if ultra-simple models---like extended Hückel models with distance-dependent matrix elements---might be sufficient for our purposes in a physical reference potential.

peastman commented 1 year ago

These models are not trained on any data close to r=0

That's the reason for using a physical prior. The ZBL potential was parametrized based on scattering data for a variety of elements. It gives an excellent match to experiment down to very short distances.

My point is that this can be avoided by simply using a functional form free of singularities to model this region that is outside the domain of applicability of our model.

You can of course create a soft-core version of any potential, if that's required for a particular application. Right now my goal is to create the most physically accurate model I can, including behaving physically at short distances.

Is the main issue the energy computation by variational optimization or dominant eigenvalue computation, or the integral computations?

No idea. I've never worked on an implementation of a semi-empirical method.

jchodera commented 1 year ago

It gives an excellent match to experiment down to very short distances.

I might have missed this in my reading, but the ZBL model in TorchMDNet contains an implementation of an approximation fit sometime prior to 1982 to calculations using the pre-1977 free-electron method of interatomic potentials that is itself an approximation to earlier multiconfiguration self-consistent field quantum chemical methods. That doesn't mean it isn't useful, but its accuracy in that regime cannot simply be assumed. ...

You can of course create a soft-core version of any potential, if that's required for a particular application. Right now my goal is to create the most physically accurate model I can, including behaving physically at short distances.

Let's align on the domain of applicability and intended applications at our next in-person meeting. I have a feeling we've diverged here, and would like to better understand the range of applications you are envisioning where accurate energetic behavior near r=0 will be meaningful and how you're intending to validate the behavior for those applications.

It's critical to remember that these are all models, and the important thing about models is how useful and predictive they are in the domains and applications of interest.

No idea. I've never worked on an implementation of a semi-empirical method.

There's a brief description of the Extended Hückel Method (EHM) in this paper by Oles Isayev, though they do the opposite of what we're considering: Use ML to predict the parameters for the EHM Hamiltonian matrix, and then diagonalize to get orbital energies.

peastman commented 1 year ago

That doesn't mean it isn't useful, but its accuracy in that regime cannot simply be assumed.

Here is something we can say with complete confidence: any potential that doesn't diverge as r approaches zero is not physically accurate. Actually, the physics in that limit is exceptionally simple. The electrons no longer provide any meaningful screening, so it's just the Coulomb repulsion between the bare nuclei. Any interaction that does something different in that limit is unphysical.

There's a brief description of the Extended Hückel Method (EHM) in this paper by Oles Isayev,

There have been several models of that sort published. See also

https://aip.scitation.org/doi/10.1063/5.0021955 https://www.pnas.org/doi/full/10.1073/pnas.2120333119

jchodera commented 1 year ago

Here is something we can say with complete confidence: any potential that doesn't diverge as r approaches zero is not physically accurate. Actually, the physics in that limit is exceptionally simple. The electrons no longer provide any meaningful screening, so it's just the Coulomb repulsion between the bare nuclei. Any interaction that does something different in that limit is unphysical.

Except nuclei fuse, and doing so releases a large quantity of energy, rather than consumes infinite energy as you suggest. This model breaks down in describing what happens when you actually put two nuclei on top of each other. It simply isn't meant to describe that regime.

Fortunately, none of our application domains or use cases requires we accurately describe what happens near r=0, as I previously noted. We are in the business of making useful models for biomolecular modeling and simulation and drug discovery.

It certainly is a good step forward to incorporate a physical model that better describes asymptotic behavior (though we should allow the parameters to be fit as well). But the point is that we should be favoring utility in our domains and use cases of interest over accuracy or asymptotic behavior in domains far outside our domain of applicability and use cases.

peastman commented 1 year ago

Yeah, I'm not worried about simulating nuclear fusion. :) We can leave the strong force out of our models. It will still be accurate down to displacements of about $10^{-6}$ nm.

Like I said before, once you create a physically accurate model it's easy to modify it to make a soft core version. For example, replace $\frac{1}{r}$ by $\frac{1}{r+\epsilon}$, and you're free to pick whatever value of $\epsilon$ is best for your application. Or multiply by $\mathrm{erf}(\alpha r)$ where $\alpha$ sets the length over which it's softened. But if you start by creating a soft-core version, it's harder to go the other way.

peastman commented 1 year ago

I've been running some experiments using delta learning with a semi-empirical reference. I compute the forces and energies with GFN2-xTB, then train an ET model on the difference between that and DFT. This makes the residual forces and energies much smaller, and training goes a lot faster. But the final model consistently ends up being a worse fit than what we get directly learning the DFT data.

Here's another interesting model based on a semi-empirical method: https://arxiv.org/abs/2210.11682. It uses DFTB, representing the interaction functions as splines that are learned from data. The lead author is now in Tom's group. From discussions with him, it's not quite ready to use yet (it only supports H, C, N, and O), but it will hopefully get there soon.

jchodera commented 1 year ago

I've always wondered if an absolutely minimal semiempirical method like NDDO or MNDO could be sufficient. I know these methods require some integrals, but they seem feasible to speed up. Ther are a number of semiempirical methods implemented within scine sparrow, for example.

I'm not sure if I understand DFTB well enough to know if it is a suitable approach for a reference potential.

peastman commented 1 year ago

https://github.com/lanl/PYSEQM has a PyTorch implementation of MNDO. Also AM1 and PM3.

jthorton commented 1 year ago

But the final model consistently ends up being a worse fit than what we get directly learning the DFT data.

Do you use any descriptors calculated at the GFN2 level as input to the model as well, Orbnet Denali seems to be a good model which uses this method and works for charged molecules, it might be good to compare with what they report in the paper.

peastman commented 1 year ago

No, it was just doing straightforward delta learning. It was mostly just a proof of concept to see what would happen. Using descriptors from GFN as inputs to the model would be more complicated, because we would need to be able to compute gradients of those descriptors.

openmm / spice-dataset