Additions for version 2

peastman commented 1 year ago

I've been talking with lots of people about what would be most useful to add in version 2 of SPICE. I've gotten lots of great suggestions. Here are some ideas in no particular order.

Solvated PubChem Molecules. This would provide data on ligand-water interactions. They could be generated the same way we did the solvated amino acids: simulate the molecule in a box of water, periodically saving conformations along with ~20 of the nearest water molecules. These will be largish, and therefore expensive to compute. We'll probably need to limit the number of molecules we include and the number of conformations for each one. I think this is a case where a little data can go a long way, since each conformation has many water molecules.
Water Clusters. This would provide some conformations that are more representative of bulk water. How large should the clusters be? What's the best way to generate them? I've been advised that classical water models often don't produce very realistic water structure.
PubChem Molecules with Boron (and possibly Silicon). I excluded these elements from the original PubChem set. They only appear in a tiny fraction of molecules, so there wouldn't have been enough data on them to be useful. But if we specifically select molecules that include them, I think we could assemble a useful amount of data. There's around 1500 molecules with B and around 1900 with Si.
Amino Acid / Small Molecule Pairs. This would provide more data on protein-ligand interactions. One way of generating them would be to use PDB structures. We could identify which amino acids contact the ligand and generate conformations for them, each conformation including the ligand plus a single amino acid.
Transition States. If you're interested in rates, it's important to model transition states accurately. Because they're higher energy, they won't be very well sampled in a thermal distribution. Can we add conformations specifically to sample transition states? One suggestion was to scan dihedral rotations while minimizing all other degrees of freedom. What molecules would it be useful to do this with? Of course, not all transition states involve dihedral rotations. What else could we do?
Metals. The goal would be to allow simulating metalloproteins. These could also be generated from PDB structures, taking the amino acids that interact with the metal. Would it be sufficient to consider only one amino acid at a time, or do you really need to include the whole environment? If the latter, these could be quite expensive to calculate.
Active Learning. This is the idea discussed in #35, to add more conformations for existing molecules to improve the accuracy of trained models.
More PubChem Molecules. More chemical diversity is always good. The number included in version 1 was mostly arbitrary, just based on how long it took to do the calculations. It would be easy to add more.

peastman commented 1 year ago

That's not bad then. We're probably looking at under 100,000 molecules. I suggest we use a sequence of two scripts. The first one will generate the states, apply the free energy filter, and write out SMILES strings to a file. Once we see how many molecules we're dealing with, we can decide how many conformations to include for each one. A second script will read in the first file, generate the conformations, and create the HDF5 file.

I can create a draft PR with an outline of the first script, if that sounds reasonable.

jchodera commented 1 year ago

This approach sounds reasonable!

The first one will generate the states, apply the free energy filter, and write out SMILES strings to a file.

Since this requires running Epik, do you want me to tackle this script, or did you want to create an outline and then I finish it up and run it?

peastman commented 1 year ago

I'll create an outline.

jchodera commented 1 year ago

Actually, your proposed two-stage approach is specifically what I wanted to avoid: we won't necessarily be able to have consistent conformers that just differ by a proton and then relax to more preferred geometries, which might make it hard to learn protonation states energies. I think it's still valuable, but we probably also want a set where we enumerate a few conformers for each molecule and then manipulate protonation states and run short optimization trajectories.

Perhaps these could be separate datasets?

Neutral Chemical Components Dictionary
Protonation/tautomer state variants (up to K each, up to L conformers) for the N most abundant CCD component
same, but short optimization trajectories from snapshots where we enumerate protonation/tautomer states on the same conformer

peastman commented 1 year ago

If we had infinite computational resources, that could be a reasonable approach. But we can only compute a very limited number of conformations. That means we need to choose them carefully so that every conformation adds as much information as possible. Computing two ~100 atom molecules that differ only in a single hydrogen, both in exactly the same conformation, makes very inefficient use of our resources. Most atoms will have nearly identical environments and forces in both of them. Choosing different conformations will contribute much more information content to the dataset.

peastman commented 1 year ago

I created a project board for SPICE 2: https://github.com/orgs/openmm/projects/2/views/1

peastman commented 1 year ago

What you're suggesting might work better as a separate dataset. We could create a collection of very small molecules, maybe around 10 atoms each, and exhaustively try all variations on each one. They'll be fast enough and the number of variations will be small enough that computation time won't be an issue. And for very small molecules, changing a single atom affects the environment of every other atom, so it will provide meaningful information for all of them.

jthorton commented 9 months ago

Inspired by the AIMNET2 dataset we could also include molecules with As and Se from PubChem although I have no idea how common they will be.

giadefa commented 9 months ago

It would be nice to have data to create a protein FF

On Wed, Sep 13, 2023 at 12:27 AM Peter Eastman @.***> wrote:

What you're suggesting might work better as a separate dataset. We could create a collection of very small molecules, maybe around 10 atoms each, and exhaustively try all variations on each one. They'll be fast enough and the number of variations will be small enough that computation time won't be an issue. And for very small molecules, changing a single atom affects the environment of every other atom, so it will provide meaningful information for all of them.

— Reply to this email directly, view it on GitHub https://github.com/openmm/spice-dataset/issues/67#issuecomment-1716602212, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOUAZ5W7Y7LZ7J43VD3X2DOV7ANCNFSM6AAAAAAZLDXQTQ . You are receiving this because you commented.Message ID: @.***>

jthorton commented 9 months ago

Splinter a dataset of protien-ligand interactions. Inputs can be found here.

pavankum commented 9 months ago

It would be nice to have data to create a protein FF

We had dipeptides and solvated amino acids in Spice 1.0, in addition to that singlepoints on the final geometries from OpenFF's 2D torsion scans of (chi1, chi2), (phi, psi) dihedral pairs of capped amino acid chains can be of interest, here are some datasets created by Chapin which were all optimized at b3lyp-d3bj/dzvp level on a 24x24 2D grid. Just re-evaluating energy on final geometries at Spice DFT level can be considered.

Some 1D torsion scans that might be of interest

pavankum commented 9 months ago

There were also some RNA datasets created by Ken Takaba which were used in training Espaloma

peastman commented 9 months ago

SPICE 2 is almost finished. It will hopefully just take a few more weeks to complete the calculations. I've opened a new issue (#92) for SPICE 3. Let's move this discussion there.

jchodera commented 8 months ago

@peastman : I know you've already included ligand:amino acid pairs in #72, but I just came across this preprint that mentions two interesting datasets:

The Splinter dataset is a collection of approximately 1.7 mil- lion systematically generated protein-ligand fragment dimers and interaction energies computed using many-body SAPT based on a Hartree-Fock (HF) representation of monomers (i.e. SAPT0). These and all other SAPT computations carried out in this work are performed in an aug-cc-pV(D+d)Z basis set (abbreviated aDZ), which yields good error cancellation.
SAPT-PDB-13K: A diverse, realistic dataset of dimers. The 13,216 dimers in SAPT-PDB-13K consist of an entire ligand interacting with one or two capped amino acids. The protein and ligand geometries are taken from crystallographic Protein Data Bank (PDB) entries, making them meaningful and practical test cases

The SAPT-PDB-13K seems to be pending deposition somewhere: "Electronic Supplementary Information (ESI) available: SAPT0 interaction energies and Cartesian coordinates of the 13,216 validation set dimers and nine protein- ligand matched pairs. See DOI: 00.0000/00000000"

peastman commented 8 months ago

Can you post those on the thread for SPICE 3 (#92)? We're collecting ideas for it now.

jchodera commented 8 months ago

Done! (https://github.com/openmm/spice-dataset/issues/92#issuecomment-1963063432)

peastman commented 8 months ago

And the calculations for SPICE 2 are DONE! I'll take a few days to run some tests, and then I can hopefully release it next week.

peastman commented 8 months ago

And it is now released!

openmm / spice-dataset

Additions for version 2 #67