[WIP] Scripts to create pretraining dataset

openmm / spice-dataset

A collection of QM data for training potential functions

MIT License

133 stars 6 forks source link

[WIP] Scripts to create pretraining dataset #65

Open peastman opened 1 year ago

peastman commented 1 year ago

This PR will have the scripts to generate the pretraining dataset discussed in #64. So far I've implemented the dipeptides subset. Let me know if this looks good. @giadefa I'd especially appreciate your feedback on what conformations to include, since you have experience on pretraining with large amounts of semi-empirical data.

The script only takes a few hours to run on my laptop. It generates about 310 MB of output data. I estimate the complete pretraining dataset will be around 10 GB, assuming we include the same molecules as the standard dataset and the same level of sampling for the other subsets.

giadefa commented 1 year ago

Get one conformation per molecule and prefer more molecules keeping the budget constant

peastman commented 1 year ago

Get one conformation per molecule and prefer more molecules keeping the budget constant

Computational budget isn't a problem. This method is super cheap. We can include more molecules and also lots of conformations per molecule.

Based on your experience, how large should it be, and how should we select the conformations?

giadefa commented 1 year ago

generate conformers as you wish (rdkit), just one, and use more molecules

On Wed, Jun 21, 2023 at 5:07 PM Peter Eastman @.***> wrote:

Get one conformation per molecule and prefer more molecules keeping the budget constant

Computational budget isn't a problem. This method is super cheap. We can include more molecules and also lots of conformations per molecule.

Based on your experience, how large should it be, and how should we select the conformations?

— Reply to this email directly, view it on GitHub https://github.com/openmm/spice-dataset/pull/65#issuecomment-1601017930, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOWYZLOJ27KDF3WMLSDXMME33ANCNFSM6AAAAAAZFHQULE . You are receiving this because you were mentioned.Message ID: @.***>

peastman commented 1 year ago

Why just one? And again, how large should the dataset be?

giadefa commented 1 year ago

Rdkit is not very good to generate more than one or two. Given a certain budget is better to have more molecules than more conformations. Realistically training on more than 10M points starts to be problematic.

On Wed, Jun 21, 2023 at 5:19 PM Peter Eastman @.***> wrote:

Why just one? And again, how large should the dataset be?

— Reply to this email directly, view it on GitHub https://github.com/openmm/spice-dataset/pull/65#issuecomment-1601039507, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOXMJWAXGYAYQOWUPCLXMMGHVANCNFSM6AAAAAAZFHQULE . You are receiving this because you were mentioned.Message ID: @.***>

peastman commented 1 year ago

We don't rely on RDKit to generate the conformations, just starting points for MD simulations.

jchodera commented 1 year ago

If you're going to run MD for generating conformations, we probably do want multiple overdispersed starting points in case crossing torsional barriers is difficult. If the RDKit conformers end up being too similar, this shouldn't be too much of a problem---it's like running more MD, especially if you allow some "burn-in" equilibration before collecting samples from each conformation.

I fear the only way to optimize the selection of N conformers x M snapshots/conformer is to train some models and assess generalization. There's no real a priori way to know what is optimal here, though there are probably reasonable lower bounds (N >= 3, M > 10?).

peastman commented 1 year ago

The current code asks RDKit to generate 10 conformers. Starting from each one, it runs MD to generate 10 conformations at each of four temperatures, for a total of 400 conformations per molecule.

@giadefa what is the problem with training on more than 10 million points? ANI-1 has 20 million, and people train on it all the time.