openmm / spice-dataset

A collection of QM data for training potential functions
MIT License
155 stars 9 forks source link

Ligand amino acid pairs #72

Closed peastman closed 1 year ago

peastman commented 1 year ago

Here's a first draft of the script to generating amino acid / ligand pairs. This was really complicated to implement. It combines information from a lot of sources, and uses a lot of tools to process it.

Ligand Expo has about 40,000 molecules. I apply a set of filters based on the size and content. Most of the filters only exclude a tiny number of molecules. The important one is limiting the molecule size. I chose to include ones with 5 to 45 atoms, which gives about 21,000 molecules. For each one I download a PDB file containing it and identify every amino acid that contacts it. On average there are about 10, with larger ligands tending to have more neighbors. I create a conformation for each one:

I estimate we'll end up with somewhere around 200,000 conformations, maybe a little more, with typical sizes of about 50-80 atoms. So it will be a somewhat expensive dataset to compute but hopefully manageable.

jchodera commented 1 year ago

Do we sample multiple conformations for each small molecule - amino acid pair? Or just a single conformation? It might be important to have some diversity of conformations for each pair so that we have information about the diversity of interactions of this pair.

Do we have a dataset of the relevant monomers as well? If not, do we want to at least include the separated molecules (e.g. where we translate the molecules away from each other by ~10 A) as well so that the interaction energy is separable from the intramolecular energies?

jchodera commented 1 year ago

Another possibility to make this calculation less expensive is to first fragment the ligand into electronically decoupled fragments. The reduced system size could allow for the inclusion of many more conformations between pair.

peastman commented 1 year ago

The conformations are taken directly from the PDB (plus some restrained energy minimization). This subset is focused much more on chemical diversity than conformational diversity.

jchodera commented 1 year ago

I think we'll need some kind of information about the magnitude of the interaction. I know your position is that by training on everything, the model should somehow be able to learn what the separated component energies are in order to be able to deconvolve the interactions from the individual components, but it's important to understand that users may not use the entire dataset altogether, and instead take these datasets as being useful on their own or in smaller combinations. If we can take some steps to make sure information about interactions is present in each dataset, we are doing our users a useful service.

For example, at minimum, we could include for each molecule-residue pair, the interacting pair from the PDB and a separated pair where one component is translated 10A away. That would at least ensure the dataset contains information on the magnitude of the interaction.

If we do a poor job of minimizing with a surrogate potential before feeding to QM, however, the data will still be garbage, so if we can include a few more configurations near contact, or take a few steps of gradient descent with QM, we will have a lot more useful information.

peastman commented 1 year ago

it's important to understand that users may not use the entire dataset altogether, and instead take these datasets as being useful on their own

That is exactly what no one should ever do! SPICE is meant to be used as a complete dataset. The individual subsets are not useful in isolation. That's why we distribute it as a single hdf5 file for the whole dataset, not a set of smaller files.

Each subset it intended to provide particular information, not to provide all types of information about a region of chemical space. For example, some subsets focus on covalent interactions while others focus on non-covalent interactions. Some subsets focus on chemical diversity while others focus on conformational diversity. Each subset is designed to add new information to the whole, not to be useful in isolation.

peastman commented 1 year ago

Here's how we described it in the paper:

The SPICE dataset is made up of a collection of subsets, each designed to add particular information. For example, some focus on covalent interactions while others focus on noncovalent ones. Different subsets also contain different types of chemical motifs. The goal is that when all the subsets are combined into the complete dataset, it should have broad sampling of all types of interactions found in the systems we are interested in.

The goal of this particular subset is to provide information about noncovalent interactions between proteins and ligands. That's its only goal. Other types of information are provided by other subsets.

jchodera commented 1 year ago

The goal of this particular subset is to provide information about noncovalent interactions between proteins and ligands. That's its only goal. Other types of information are provided by other subsets.

I do understand the philosophy, but there are so many assumptions that are going into this that we very much risk this not being useful:

  1. We assume that the monomer energies can all be accurately modeled for the diversity of chemical species represented here by using all the other datasets
  2. We assume that the surrogate potential is sufficiently close to the QM potential that the energy will be "close" (<< 1 kcal/mol error compare to well-separated dimer); if not, there may be essentially no information content about thermodynamically populated region of configurational space.
  3. We assume that a single conformation of the dimer contains most of the additional information content (in the context of the rest of the dataset), whereas adding a few more snapshots with the dimer well-separated or perturbed a bit could contain much more information (at least within this dataset).

I'm just skeptical all of these assumptions hold to the degree you're expecting them to hold---especially as the person who built one of the force fields you're using!---and think there are simple steps we could take to hedge our bets in case these assumptions do not all hold.

jchodera commented 1 year ago

Here's an alternative ideas: If we use these configurations as is to generate an OptimizationDataset to collect 3-5 steps of gradient descent on the QM potential surface for each snapshot, the energies and gradients collected provide a wealth of information that hedges our bets on these assumptions at 3-5x the cost.

peastman commented 1 year ago

We assume that the surrogate potential is sufficiently close to the QM potential that the energy will be "close"

It's not an assumption. I tested this when designing the first version of the dataset, and confirmed it's true (see https://github.com/openmm/spice-dataset/issues/2#issuecomment-910605529 and https://github.com/openmm/spice-dataset/issues/2#issuecomment-910871597). But it also doesn't matter. We're learning physics, not energies of particular molecules. That's especially true when training on forces, which are purely local. As long as the local environment of each atom is in the general region populated during simulations, which is certainly true, we'll get useful information.

We assume that a single conformation of the dimer contains most of the additional information content

We assume that hundreds of thousands of conformations for tens of thousands of dimers contain information about many different local arrangements of atoms. Any one conformation has very little information, but that's ok.

If we use these configurations as is to generate an OptimizationDataset to collect 3-5 steps of gradient descent on the QM potential surface for each snapshot

That enormously increases the cost. Each step of gradient descent is a separate DFT calculation. And optimization trajectories are a very inefficient way of sampling the energy surface, because they produce clusters of conformations that are very close together, whereas you really want widely separated conformations that sample different regions of chemical and/or conformational space. This is another thing we studied while designing the first version of the dataset.

jchodera commented 1 year ago

It's not an assumption. I tested this when designing the first version of the dataset, and confirmed it's true (see https://github.com/openmm/spice-dataset/issues/2#issuecomment-910605529 and https://github.com/openmm/spice-dataset/issues/2#issuecomment-910871597).

We do have data on this now. The retrospective potential energy RMSE between ff14SB and the B3LYP-D3BJ/DZVP level of theory (the OpenFF default) for the SPICE dipeptide set is 4.44 kcal/mol. This is very high. In a statistical mechanics sense, trying to reweight snapshots from thermal equilibrium at ff14SB to thermal equilibrium at QM would result in 99.9998% of the work being wasted.

Interestingly, the experiment you mention had a few differences:

  1. It was performed for energy-minimized snapshots, not snapshots sampled from equilibrium
  2. It was only for alanine dipeptide
  3. The QM level of theory was WB97X-D3/def2-TZVP for your test

The small molecule sets don't fare any better---the error on SPICE PubChem to OpenFF 2.0.0 is 4.43 kcal/mol. This si again for the B3LYP-D3BJ/DZVP energies---the level of theory that OpenFF is fit to.

It does seem like it's worth revisiting this experiment to check that assumption, since our data on the completed SPICE 1.0 dataset suggests it doesn't hold.

But it also doesn't matter. We're learning physics, not energies of particular molecules.

This is certainly our goal, but far from where we are right now. I haven't seen any significant evidence that current generation models are "learning physics". The MM models---which we claim are "physical"---certainly would not do well with energies that are way up an anharmonic wall, since they omit all higher-order anharmonicities. We might hope that the ML potentials can extrapolate to thermally relevant regions even from high-energy regions, but I have not seen clear evidence of this.

That's especially true when training on forces, which are purely local. As long as the local environment of each atom is in the general region populated during simulations, which is certainly true, we'll get useful information.

The forces are just the gradient of the energy. They do provide a bit more non-local information about the potential, but there is nothing like a Lipschitz bound that restricts how rapidly they can vary. So it might say very little about thermally relevant or low-energy regions.

peastman commented 1 year ago

To keep things manageable, I split it into multiple files based on the number of atoms in the ligand. I got through ligands with up to 36 atoms. That includes 194,275 dimer conformations involving 10,584 ligands. I could easily do more, but that's already a lot of data. Given uncertainties about how much compute time we'll have, it seemed sensible to stop there for the moment.

I tried enabling hdf5 compression, but it made the files larger instead of smaller. On the other hand, just gzipping the complete files is fairly effective, roughly cutting the file size in half.

peastman commented 1 year ago

The forces are just the gradient of the energy. They do provide a bit more non-local information about the potential, but there is nothing like a Lipschitz bound that restricts how rapidly they can vary. So it might say very little about thermally relevant or low-energy regions.

What I meant is that energy is extensive while forces are intensive. If you make a small error in every atom, it will lead to a big error in the total energy but only a small error in the forces. All those small errors add up to make training on energy less effective. But they don't make training on forces much less effective.