openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
153 stars 9 forks source link

Additions for version 2 #67

Closed peastman closed 8 months ago

peastman commented 1 year ago

I've been talking with lots of people about what would be most useful to add in version 2 of SPICE. I've gotten lots of great suggestions. Here are some ideas in no particular order.

peastman commented 1 year ago

That's not bad then. We're probably looking at under 100,000 molecules. I suggest we use a sequence of two scripts. The first one will generate the states, apply the free energy filter, and write out SMILES strings to a file. Once we see how many molecules we're dealing with, we can decide how many conformations to include for each one. A second script will read in the first file, generate the conformations, and create the HDF5 file.

I can create a draft PR with an outline of the first script, if that sounds reasonable.

jchodera commented 1 year ago

This approach sounds reasonable!

The first one will generate the states, apply the free energy filter, and write out SMILES strings to a file.

Since this requires running Epik, do you want me to tackle this script, or did you want to create an outline and then I finish it up and run it?

peastman commented 1 year ago

I'll create an outline.

jchodera commented 1 year ago

Actually, your proposed two-stage approach is specifically what I wanted to avoid: we won't necessarily be able to have consistent conformers that just differ by a proton and then relax to more preferred geometries, which might make it hard to learn protonation states energies. I think it's still valuable, but we probably also want a set where we enumerate a few conformers for each molecule and then manipulate protonation states and run short optimization trajectories.

Perhaps these could be separate datasets?

peastman commented 1 year ago

If we had infinite computational resources, that could be a reasonable approach. But we can only compute a very limited number of conformations. That means we need to choose them carefully so that every conformation adds as much information as possible. Computing two ~100 atom molecules that differ only in a single hydrogen, both in exactly the same conformation, makes very inefficient use of our resources. Most atoms will have nearly identical environments and forces in both of them. Choosing different conformations will contribute much more information content to the dataset.

peastman commented 1 year ago

I created a project board for SPICE 2: https://github.com/orgs/openmm/projects/2/views/1

peastman commented 1 year ago

What you're suggesting might work better as a separate dataset. We could create a collection of very small molecules, maybe around 10 atoms each, and exhaustively try all variations on each one. They'll be fast enough and the number of variations will be small enough that computation time won't be an issue. And for very small molecules, changing a single atom affects the environment of every other atom, so it will provide meaningful information for all of them.

jthorton commented 9 months ago

Inspired by the AIMNET2 dataset we could also include molecules with As and Se from PubChem although I have no idea how common they will be.

giadefa commented 9 months ago

It would be nice to have data to create a protein FF

On Wed, Sep 13, 2023 at 12:27 AM Peter Eastman @.***> wrote:

What you're suggesting might work better as a separate dataset. We could create a collection of very small molecules, maybe around 10 atoms each, and exhaustively try all variations on each one. They'll be fast enough and the number of variations will be small enough that computation time won't be an issue. And for very small molecules, changing a single atom affects the environment of every other atom, so it will provide meaningful information for all of them.

— Reply to this email directly, view it on GitHub https://github.com/openmm/spice-dataset/issues/67#issuecomment-1716602212, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOUAZ5W7Y7LZ7J43VD3X2DOV7ANCNFSM6AAAAAAZLDXQTQ . You are receiving this because you commented.Message ID: @.***>

jthorton commented 9 months ago

Splinter a dataset of protien-ligand interactions. Inputs can be found here.

pavankum commented 9 months ago

It would be nice to have data to create a protein FF

We had dipeptides and solvated amino acids in Spice 1.0, in addition to that singlepoints on the final geometries from OpenFF's 2D torsion scans of (chi1, chi2), (phi, psi) dihedral pairs of capped amino acid chains can be of interest, here are some datasets created by Chapin which were all optimized at b3lyp-d3bj/dzvp level on a 24x24 2D grid. Just re-evaluating energy on final geometries at Spice DFT level can be considered.

Some 1D torsion scans that might be of interest

pavankum commented 9 months ago

There were also some RNA datasets created by Ken Takaba which were used in training Espaloma

peastman commented 9 months ago

SPICE 2 is almost finished. It will hopefully just take a few more weeks to complete the calculations. I've opened a new issue (#92) for SPICE 3. Let's move this discussion there.

jchodera commented 8 months ago

@peastman : I know you've already included ligand:amino acid pairs in #72, but I just came across this preprint that mentions two interesting datasets:

image

image

The SAPT-PDB-13K seems to be pending deposition somewhere: "Electronic Supplementary Information (ESI) available: SAPT0 interaction energies and Cartesian coordinates of the 13,216 validation set dimers and nine protein- ligand matched pairs. See DOI: 00.0000/00000000"

peastman commented 8 months ago

Can you post those on the thread for SPICE 3 (#92)? We're collecting ideas for it now.

jchodera commented 8 months ago

Done! (https://github.com/openmm/spice-dataset/issues/92#issuecomment-1963063432)

peastman commented 8 months ago

And the calculations for SPICE 2 are DONE! I'll take a few days to run some tests, and then I can hopefully release it next week.

peastman commented 8 months ago

And it is now released!