Closed TommyDzh closed 6 months ago
The subsets are not intended to be used individually. To train a general purpose model that can simulate arbitrary molecules over a large area of chemical space, you need a lot of data. If you only use a tiny fraction of the available data, it's unlikely to learn successfully.
A lot of what you're seeing in that graph is the variation in size. Some molecules have a lot more atoms than others, and those ones have larger formation energies.
Thank you for your reply. I still have two questions:
The PubChem molecules were split into multiple datasets just for computational convenience. The version of QCFractal we were using at the time would run out of memory if a dataset was too large, so it had to be split up.
The subsets (PubChem, DES370K, etc.) are intended to be complementary. Each one was included to provide a certain type of information, with the goal that when all of them were combined, they would together have enough information to define a general potential function.
I'm not sure what you mean by standardizing the formation energies. They are what they are. They cover a big range, but that's because real energies of real molecules cover a big range. A good model should be able to reproduce it. Most models produce per-atom energies that are summed to get the total energy, so scaling with the number of atoms happens automatically.
Here standardizing means computing the mean and std of formation energies for molecules in the training set, and the model predictions $y{pred}$ will be compared to ground truth formation energy $y{form}$ after $y_{pred}*std + mean$. I have seen OC20 adopts standardizing for their energy prediction.
You can try, but I doubt it will be useful in this case.
Thank your for providing such a large dataset covering a large proportion of chemical space. I try training EGNN (EquiformerV2) using Pubchem subset of SPICE (SPICE_PubChem_Set_1_Single_Points_Dataset_v1.3). However, I find the formation energies cover a wide range (as shown below, I convert the unit from hartree to kcal/mol), and the model has problem converging in training. I wonder how should I preprocess the energies for stable training. Currently, I standarize the conformation energies.