Pretraining dataset - Githubissues

peastman commented 1 year ago

I think there could be value in creating a separate dataset for pretraining. It would cover the same chemical space as the standard SPICE dataset, but have many more conformations and be computed at a much lower level of theory. The idea would be to pretrain your model on the large dataset, then fine tune it on the smaller, higher quality one.

This raises several questions.

How large should the pretraining dataset be? I suggest roughly 10x the standard one.
What level of theory should it use? An obvious choice would be GFN2-xTB since it's very fast (a fraction of a second for most calculations) and not too terrible accuracy.
Should it include larger molecules than in the SPICE dataset? For example, longer peptides and drug molecules with more than 50 atoms.
What results should it include? To keep the size manageable, I suggest energies, forces, and nothing else.
How should the conformations be generated? In particular, should it include higher energy conformations than we currently have?

For example, the current depeptides and PubChem subsets include 50 conformations for each molecule: 25 high energy that are sampled at 500K and 25 low energy that are partially energy minimized. For the pretraining dataset we might instead include 100 conformations at each of four temperatures: 100K, 300K, 500K, and 1000K. In place of DES 370K we could use DES 5M.

jchodera commented 1 year ago

This is a good idea, and will enable experiments to be run to assess the utility of different pretraining approaches.

How large should the pretraining dataset be? I suggest roughly 10x the standard one.

10x would certainly enable useful assessments of the impact of pretraining on data efficiency to be conducted.

What level of theory should it use? An obvious choice would be GFN2-xTB since it's very fast (a fraction of a second for most calculations) and not too terrible accuracy.

QCEngine supports a number of semiempirical methods, though the host programs it drives would also have to be deployed in the QCFractal compute environment.

GFN2-xTB sounds like a good starting point.

Should it include larger molecules than in the SPICE dataset? For example, longer peptides and drug molecules with more than 50 atoms.

If the goal is to do experiments on pretraining to improve data efficiency, it would be useful to draw molecules from the same distribution as the SPICE molecules were drawn from. We could increase both the number of molecules and number of conformers/molecule dimensions.

What results should it include? To keep the size manageable, I suggest energies, forces, and nothing else.

That seems reasonable. Does GFN2-xTB even support other properties?

How should the conformations be generated? In particular, should it include higher energy conformations than we currently have?

If the goal is to assess the impact of pretraining on data efficiency, using the same process to generate data would be useful.

If the goal is to assess other methods for generating data for utility and data efficiency and evaluate on different kinds of ensembles, it would be useful to consider generating a number of datasets constructed at different temperatures as you suggest that could either be used separately or together.

If the goal is to scout which other datasets might be more useful (e.g. experimenting with training models at the same level of theory from different data subsets), it may also be of interest to generate data from different chemical spaces such as those we had previously identified as high value. For example, the PDB Chemical Components Dictionary, or Enamine Building Blocks (freely downloadable here).

peastman commented 1 year ago

QCEngine supports a number of semiempirical methods=, though the host programs it drives would also have to be deployed in the QCFractal compute environment.

There's no need to use QCFractal for this. Running the xtb calculations takes less time than generating the conformations in the first place. It's simplest to just do everything at once in a single script.

The goal is not to do any of the things you listed. The goal is to create a useful dataset that can be used in practice for pretraining models.

The more data you train on, the better your model ends up being. Ideally you should include some data for very rare, very high energy conformations. That reduces the risk of the model doing something strange as soon as it gets outside the range of typical conformations. Generating all that data with a high quality method would be very expensive. So instead you pretrain on data generated with a cheap method, then fine tune on a smaller amount of high quality data.

openmm / spice-dataset

Pretraining dataset #64