More data to improve accuracy

peastman commented 2 years ago

A goal for version 2 is to improve the accuracy of trained models by generating additional data. This can be done in a few ways.

The simplest approach is to train a model, select maybe the 1000 molecules with the largest errors, and randomly generate more conformations for them. That would be simple to do and would likely help significantly.

A more sophisticated approach would be to use active learning of some sort, such as the method described in https://www.nature.com/articles/s41467-021-25342-8. It trains multiple models, then searches for conformations on which their disagreement with each other is maximum. Those conformations should be especially informative when added to the dataset. It's also possible we could select the molecules with the largest disagreements between models rather than the ones with the largest errors. I'm not sure which would be better.

Getting still more complicated, we could try to intelligently add more molecules, not just more conformations for existing molecules. We would need to somehow figure out what new molecules would be especially useful to add. I'm not sure how this would be done. Perhaps train a neural network to predict MAE from the ECFP4 fingerprint, and look for new molecules it predicts would have large errors? But it might just end up identifying molecules that are as similar as possible to existing ones with large errors.

jchodera commented 2 years ago

Is it possible to first articulate a more precise statement of the goal?

"Improve the accuracy of trained models by additional data" is ambiguous in a number of ways that matter. Can we be more specific over the distribution of molecules (or heterogeneous systems) and conformations/energies/temperatures we are interested in improving accuracy over?

If we can answer this question a bit more precisely, we will be able to more directly address the goal.

peastman commented 2 years ago

I'm talking specifically about accuracy on the existing range of molecules and conformations. Extending the range to cover more molecules or higher energies is a separate goal that will be addressed through different methods.

As an example, when I was testing out models trained on an early subset of the data, I found it had terrible accuracy on some simple pairs of monatomic ions. This didn't seem to be an intrinsic problem with the architecture, which should have been able to easily handle those cases. It just didn't have very much data on some ions. So I added another data subset to better sample ion pairs, and it produced a huge improvement.

Those are the cases I'm talking about improving: things that are already present in the dataset, but the amount of data is too small for it to learn them effectively. For example, the list of molecules with the largest errors is dominated by ones containing the motif =C=. Is there something about the model architecture that makes it incapable of modelling that case accurately? Maybe, but I doubt it. More likely it's just because we only have a few thousand conformations containing that motif in the entire dataset. If so, adding more conformations should improve accuracy on the existing ones.

jthorton commented 2 years ago

Not sure if this is in scope but one reason I am particularly intrested in these models is in their ability to rapidly parameterise molecular mechanics force fields on the fly with bespoke terms. It would be intresting to see how well the current data allows a model to learn the torsion potential energy surface for some molecules and if some other sampling method should be used to include more data about this part of the energy surface. For the ANI-2x they used an active learning method which could be incorperated into the preposed method above.

jchodera commented 1 year ago

Thanks for the clarification!

Is it clear whether the issue is that we need more conformations for the existing chemical species, or more chemical species with poorly represented chemical moieties (which would also provide more conformations for that moiety)? Do we need to do any experiments to determine which is more valuable?

Message ID: @.***>

peastman commented 1 year ago

Experiments are definitely required. #45 is related to this. The process I'm following is 1) train models, 2) identify flaws in those models, and 3) determine how to improve them. For the first round of trained models, I concluded the most serious flaws are due to limitations of the model architecture, not the training data. So right now I'm working to improve the architecture. Then I'll train new models and repeat. When/if we find flaws that seem like they're best addressed by adding more training data, we can evaluate what data would be most useful for the purpose.

openmm / spice-dataset

More data to improve accuracy #35