openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
133 stars 6 forks source link

Energy preprocessing for model training #104

Closed TommyDzh closed 1 month ago

TommyDzh commented 2 months ago

Thank your for providing such a large dataset covering a large proportion of chemical space. I try training EGNN (EquiformerV2) using Pubchem subset of SPICE (SPICE_PubChem_Set_1_Single_Points_Dataset_v1.3). However, I find the formation energies cover a wide range (as shown below, I convert the unit from hartree to kcal/mol), and the model has problem converging in training. I wonder how should I preprocess the energies for stable training. Currently, I standarize the conformation energies.

image

peastman commented 2 months ago

The subsets are not intended to be used individually. To train a general purpose model that can simulate arbitrary molecules over a large area of chemical space, you need a lot of data. If you only use a tiny fraction of the available data, it's unlikely to learn successfully.

A lot of what you're seeing in that graph is the variation in size. Some molecules have a lot more atoms than others, and those ones have larger formation energies.

TommyDzh commented 2 months ago

Thank you for your reply. I still have two questions:

  1. "The subsets are not intended to be used individually." for subsets you mean SPICE_PubChem_Set_1_Single_Points_Dataset_v1.3 or the whole PubChem subset? I wonder why can't I use SPICE_PubChem_Set_1 to train and test model performance. I think I can use it as a mixture of MD17.
  2. As shown in spice-models, we can just use formation energies as the target energy of the model. Considering the the formation energies vary in a large scale, do your recommend standaring the formation energies for model training? If so, as there are many Subsets, should I standarize within each subset and use different prediction head for each subset, or standarize all the subsets as a whole?
peastman commented 2 months ago

The PubChem molecules were split into multiple datasets just for computational convenience. The version of QCFractal we were using at the time would run out of memory if a dataset was too large, so it had to be split up.

The subsets (PubChem, DES370K, etc.) are intended to be complementary. Each one was included to provide a certain type of information, with the goal that when all of them were combined, they would together have enough information to define a general potential function.

I'm not sure what you mean by standardizing the formation energies. They are what they are. They cover a big range, but that's because real energies of real molecules cover a big range. A good model should be able to reproduce it. Most models produce per-atom energies that are summed to get the total energy, so scaling with the number of atoms happens automatically.

TommyDzh commented 2 months ago

Here standardizing means computing the mean and std of formation energies for molecules in the training set, and the model predictions $y{pred}$ will be compared to ground truth formation energy $y{form}$ after $y_{pred}*std + mean$. I have seen OC20 adopts standardizing for their energy prediction.

peastman commented 2 months ago

You can try, but I doubt it will be useful in this case.