Near-term quantum machine learning (QML) use cases

openmm / openmm

OpenMM is a toolkit for molecular simulation using high performance GPU code.

1.47k stars 511 forks source link

Near-term quantum machine learning (QML) use cases #2940

Open jchodera opened 3 years ago

jchodera commented 3 years ago

This thread is intended to collect some concrete use cases for adding first-class support for quantum machine learning (QML) use cases.

We have identified several classes of use cases we would like to support in the short term:

Hybrid QML/MM systems: A ligand is treated with QML, such as ANI or SchNet, and the rest with MM. Multiple formulations are possible, but in the simplest case, the ligand vacuum MM energy is subtracted and the ligand vacuum QML energy is added back.
Small QML systems: A small molecule in solvent in a droplet or periodic box, as in this paper.
Hybrid ML/physical potentials such as PhysNet, which use ML for short-range and electrostatics for long-range
Coarse-graining with ML models such as CGSchNet.

Architectures and models we have flagged for support in the near term are

ANI potentials: including ANI-1x, [ANI-1ccx](), and ANI-2x, available within TorchANI.
SchNet potentials: using models available from schnetpack

Hardware-accelerated support for these models will be provided through the new open source NNPOps hardware-accelerated kernel library, which we intend to support acceleration in both molecular simulation codes (beyond OpenMM) and machine learning frameworks (such as PyTorch, TensorFlow, and JAX). OpenMM is also providing plugins that provide support for models exported from PyTorch and TensorFlow, where our open is to eventually provide hardware-accelerated ops for common QML featurization operations for these frameworks that will ensure fast execution within OpenMM.

In this thread, we'll post more details about specific use cases we'd like to support in the very near term.

cc: @peastman @giadefa

jchodera commented 3 years ago

Tagging @maxentile and @mwieder on this use case.

For automating the setup of smally systems entirely treated with QML models, such as a small molecule in a water droplet or periodic box (as in this paper), we might consider providing a Python tool to automate the generation of parameters:

from simtk.openmm.app import MLModelRepository

# Create a factory for ANI-2x models
model = MLModelRepository('ANI-2x')
# or grab a model associated with a specific DOI
model = MLModelRepository('10.2084/jctc.2985019')
# Create an OpenMM System from a Topology where the entire system is treated with QML
system = model.createSystem(topology, periodic=True)
# Simulate it in OpenMM
integrator = openmm.LangevinIntegrator(temperature, collision_rate, timestep)
context  = openmm.Context(system, integrator)
context.setPositions(positions)
integrator.step(nsteps)

In this paper, we performed free energy calculations by interpolating between two potential energy functions:

U(x;\lambda) = (1-\lambda) * U_0(x) + \lambda * U_1(x)

We'd love the ability to do that here. The systems just differed by two variants of the same small molecule (here, tautomers). One way to handle this would be to extend the Topology class to support dummy atoms, and have two Topology objects, topology0 and topology1, that simply differ in which atoms are dummy and which atoms are real. Dummy atoms would be excluded from the QML potential calculations. Then we could define

# Create a factory for ANI-2x models
model = MLModelRepository('ANI-2x')
# Create OpenMM ANI Forces from a Topology where the system is treated with QML
ani_force_0 = model.createForce(topology0, periodic=False)
ani_force_1 = model.createForce(topology1, periodic=False)
# Combine them using an alchemical parameter in a CustomCVForce
cv_force = openmm.CustomCVForce('(1-lambda)*U0 + lambda*U1')
cv_force.addCollectiveVariable('U0', ani_force_0)
cv_force.addCollectiveVariable('U1', ani_force_1)
# Create a System to contain this force
# TODO: Could this be made simpler?
system = System()
[ system.addParticle(atom.element.mass) for atom in topology.atoms() ]
system.addForce(cv_force)
# Add a central harmonic restraint force to water oxygens to keep droplet shape
restraint_force = CustomExternalForce('(K/2) * (x^2 + y^2 + z^2)')
restraint_force.addGlobalParameter('K', 0.6 * units.kilocalories_per_mole / sigma**2)
[ restraint_force.addParticle(index, []) for index, atom in topology.atoms() if (atom.residue.name=='WAT' and atom.name=='O') ]
system.addForce(restraint_force)

This approach is simple, but misses out on some potential optimizations since most of the atoms in the system will see identical features in both systems. There may be other simple alternatives.

In addition, we may want a way to create ANI forces directly:

# Specify an ANIForce with specified parameters, Topology object (to retrieve elements from), and atom indices to apply the force to
ani_force = openmm.ANIForce(parameters, topology, atom_indices)

This API could use refinement, but the key ideas are

We should be able to define or update the parameters programmatically
The Topology can provide the element information
The atom_indices could allow us to restrict the ANI force to apply to only a subset of atoms
There may be other discrete choices to change, such as the cutoff

jchodera commented 3 years ago

Tagging @dominicrufa and @hannahbrucemacdonald on this use case.

The most critical near-term use case is supporting hybrid QML/MM potentials, where part of the system is treated with QML and part with ML. As we recently demonstrated in this preprint, this has the potential to deliver significant increases in accuracy for modeling protein-ligand binding.

The simplest QML/MM hybrid potential subtracts the MM energy of the QML region (e.g., a ligand) before adding back the QML treatment:

In this case, we start with a standard solvated protein-ligand system parameterized with an MM force field, and want to generate the QML/MM system from this. Suppose we've already parameterized the system with openmmforcefields:

from simtk import unit
from simtk.openmm import app
# Use rigid TIP3P water, but don't use constraints to hydrogens for the solutes
forcefield_kwargs = { 'constraints' : None, 'rigidWater' : True, 'removeCMMotion' : False, 'hydrogenMass' : 2*unit.amu }
# Initialize a SystemGenerator using the Open Force Field Initiative "Parsley" force field for small molecules
from openmmforcefields.generators import SystemGenerator
system_generator = SystemGenerator(forcefields=['amber/ff14SB.xml', 'amber/tip3p_standard.xml'], small_molecule_forcefield='openff-1.3.0', forcefield_kwargs=forcefield_kwargs)
# Create an OpenMM System from a Topology object
system = system_generator.create_system(openforcefield_topology)
# Alternatively, create an OpenMM System from an OpenMM Topology object and a list of openforcefield Molecule objects
molecules = Molecule.from_file('molecules.sdf', file_format='sdf')
system = system_generator.create_system(openmm_topology, molecules=molecules)

Now, we need a convenient way to create the QML/MM hybrid system, where we subtract the vacuum energy for the ligand and add back the QML energy for the ligand in vacuum. Ideally, we could also scale this contribution with a lambda parameter:

U(x; \lambda) = U_MM(x) + \lambda * [ U_{QML,ligand}(x)- U_{MM,ligand}(x) ]

to allow computing MM -> QML free energy contributions.

We could also perform free energy calculations directly with the QML/MM potential using a System object that is already alchemically modified, or modifying it after we have constructed the QML/MM System.

We could reuse the ANIForce idea here, applying it only to the ligand atom_indices:

ani_force = openmm.ANIForce(parameters, topology, atom_indices)

but it's harder to specify that we want to subtract all of the force contributions for a separate system in which only the ligand contributes. Any ideas here?

Also, we probably want to (1) make it easy to select a different QML model, and (2) make it very easy to set up the whole parameterized system. Perhaps we want a high-level API like openmmforcefields.generators.SystemGenerator?

system_generator = SystemGenerator(forcefields=['amber/ff14SB.xml', 'amber/tip3p_standard.xml'], small_molecule_forcefield='openff-1.3.0', forcefield_kwargs=forcefield_kwargs, ml_model='ANI-2x')
# Create an OpenMM System using MM parameters for rest of system and QML parameters for specified ligand atom indices
system = system_generator.create_system(openforcefield_topology, ml_atom_indices=ligand_atom_indices)

We would also want to support other QML/MM potential hybrids, such as those where the atom environment around each of the QML atoms is included so that we get some intermolecular interactions in the QML part of the potential, though this is a bit trickier.

peastman commented 3 years ago

I have a few thoughts/questions on this subject. They aren't any kind of coherent plan, just some things that come to mind.

In all of the examples above, you assume the user is using a stock, pre-trained ML model. What about cases where they want to train their own model? Is that a case we want to support? Or do we just say that's beyond the scope of what we're trying to automate, and anyone who wants to train their own model needs to already know how to do it on their own?

At present, there just aren't a lot of stock, pre-trained models available. ANI-1 and ANI-2 are the only commonly used ones I know of. In time that number will hopefully increase. Your MLModelRepository class seems to assume there will be more in the future, and possibly that we're going to be hosting them?

In your discussion of hybrid potentials, you described subtracting the classical energy for the ligand and adding the ML model's energy. An alternative formulation would be to train the model to directly reproduce the difference in energy between classical and QM models. This would make your ML model less general, since it would be specific to a particular classical force field, but it also might make it more accurate. The energy difference should be a lot smaller and, hopefully, a lot smoother than the QM energy on its own, since the classical model should already be capturing a lot of how the energy varies with conformation.

One way to handle this would be to extend the Topology class to support dummy atoms

What extensions did you have in mind? We already have the concept of "extra particles". Any atom whose element is None is considered to be an extra particle, not a physical atom. The Topology doesn't make any assumptions about what it represents, whether it's a vsite or a Drude particle or something else. It's up to other classes (like ForceField) to decide what to do with it.

jchodera commented 3 years ago

In all of the examples above, you assume the user is using a stock, pre-trained ML model. What about cases where they want to train their own model? Is that a case we want to support? Or do we just say that's beyond the scope of what we're trying to automate, and anyone who wants to train their own model needs to already know how to do it on their own?

I think the nearest-term use case is making it easy to integrate stock, pre-trained models.

We absolutely want to enable users to train their own models in their favorite ML framework, but this is more involved, so may be the next nearest-term use case. CCing @maxentile @yuanqing-wang @mwieder @raimis @ppxasjsm to provide input.

They might train to several kinds of data:

Existing QM data from QCArchive: We could provide an easy way to download and cache data from QCPortal or the pre-aggregated ML datasets for training.
Experimental measurements: It's also possible to fit to expectations (such as densities) and free energies (such as ligand binding free energies), though this is more complicated. It involves cycles of generating simulation data with one or more QML or QML/MM models, exporting that data into an ML framework, and taking many iterations of fitting using MBAR to reweight the data based on computing U(x) and dU/dparams(x) within the ML framework. Automating this cycle will be helpful, but much more complicated, and likely done in collaboration with the Open Force Field Initiative (e.g. @SimonBoothroyd and team).

Again, these would probably be Python tools we would engineer within a popular ML framework (e.g. PyTorch) that would make it easy and fast to train or retrain models and expert them to OpenMM. We might also work with the Open Force Field Initiative developers (@j-wags and @dotsdl) to make it easy to take fragments from snapshots from OpenMM into QCFractal to automate QM computation and ingestion into QCArchive so that adaptive model construction would be easier.

At present, there just aren't a lot of stock, pre-trained models available. ANI-1 and ANI-2 are the only commonly used ones I know of. In time that number will hopefully increase. Your MLModelRepository class seems to assume there will be more in the future, and possibly that we're going to be hosting them?

TorchANI has ANI-1x, ANI-1ccx, and ANI-2x, with more on the way. SchNetPack lists trained models available for download. AIMNet provides a pretrained model

These models are coming---the question is what to do about deploying them. Clearly, having everyone download the tools separately, figure out how to install them and their dependencies, and then glue them into their own code isn't desirable. What's the best way for us to streamline this? We might talk with MolSSI about how we might work together to create a repository of these models. Or we could write an adapter to ingest or convert these models from their current host repositories.

In your discussion of hybrid potentials, you described subtracting the classical energy for the ligand and adding the ML model's energy. An alternative formulation would be to train the model to directly reproduce the difference in energy between classical and QM models. This would make your ML model less general, since it would be specific to a particular classical force field, but it also might make it more accurate. The energy difference should be a lot smaller and, hopefully, a lot smoother than the QM energy on its own, since the classical model should already be capturing a lot of how the energy varies with conformation.

That's a perfectly fine use case we should support as well! It has multiple drawbacks (doesn't avoid MM singularities, doesn't allow bond dissociation) but also several advantages (it automatically gets the physical limits correct).

What extensions did you have in mind? We already have the concept of "extra particles". Any atom whose element is None is considered to be an extra particle, not a physical atom. The Topology doesn't make any assumptions about what it represents, whether it's a vsite or a Drude particle or something else. It's up to other classes (like ForceField) to decide what to do with it.

That could work, if classes like ForceField just ignore them!

giadefa commented 3 years ago

On Wed, Dec 9, 2020 at 1:08 AM peastman notifications@github.com wrote:

I have a few thoughts/questions on this subject. They aren't any kind of coherent plan, just some things that come to mind.

In all of the examples above, you assume the user is using a stock, pre-trained ML model. What about cases where they want to train their own model? Is that a case we want to support? Or do we just say that's beyond the scope of what we're trying to automate, and anyone who wants to train their own model needs to already know how to do it on their own?

At present, there just aren't a lot of stock, pre-trained models available. ANI-1 and ANI-2 are the only commonly used ones I know of. In time that number will hopefully increase. Your MLModelRepository class seems to assume there will be more in the future, and possibly that we're going to be hosting them?

In your discussion of hybrid potentials, you described subtracting the classical energy for the ligand and adding the ML model's energy. An alternative formulation would be to train the model to directly reproduce the difference in energy between classical and QM models. This would make your ML model less general, since it would be specific to a particular classical force field, but it also might make it more accurate. The energy difference should be a lot smaller and, hopefully, a lot smoother than the QM energy on its own, since the classical model should already be capturing a lot of how the energy varies with conformation.

TorchMD does practically that. It allows to fit a classical potential of your choice, compute the delta forces and train a pytorch model (in this case Schnet), then run the simulation adding the classical force field with the NNP model. Look at the tutorial: https://github.com/torchmd/torchmd-cg/blob/master/tutorial/Chignolin_Coarse-Grained_Tutorial.ipynb

One way to handle this would be to extend the Topology class to support dummy atoms

What extensions did you have in mind? We already have the concept of "extra particles". Any atom whose element is None is considered to be an extra particle, not a physical atom. The Topology doesn't make any assumptions about what it represents, whether it's a vsite or a Drude particle or something else. It's up to other classes (like ForceField) to decide what to do with it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openmm/openmm/issues/2940#issuecomment-741283241, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOQHNBOJIGJDJN6SQ3LST25ZNANCNFSM4UJQ56FA .

peastman commented 3 years ago

Here's an attempt at summarizing what's been said so far. First of all, there are two major ways someone might want to use this:

Use an existing pre-trained model to run simulations.
Train their own model.

We want to support both, but our initial focus will be on 1. People who want to do 2 are assumed to know what they're doing, so we'll do less to help them out. This will require creating a repository for pre-trained models. We might want to work with MolSSI on this.

Next, there are two types of simulations people might want to do:

Model everything with an ML potential.
Model most of the system with a classical potential and use the ML potential only for a small piece of it.

It's important for us to support both of these.

We want to support hybrid MM/ML potentials. There are two common approaches to this:

Long range interactions are handled entirely with the MM potential. The ML part is used only for short range parts of the potential.
The ML model also affects the long range calculation. For example, in PhysNet a neural network computes the partial charges used to evaluate the long range Coulomb interactions at each time step.

The first one is trivial to implement. The second one takes more work, especially if we're concerned about efficiency. The current implementation of Coulomb interactions is not optimized for the case where all the charges are changing every time step. It might be best to initially focus on the first case and leave the second one for later.

An interesting model to try would be to start with the most recent OpenFF force field and train a SchNet model to compute a correction to it. This would have a number of nice features. It could easily support a very wide range of molecules. It would be trivial to turn the ML part on and off in free energy calculations. It would be really easy to model only part of the system with ML. And we probably could use a simpler SchNet model (fewer layers, fewer features per layer, and/or shorter cutoff) than you need then computing the total energy with ML, which would make it faster.

jchodera commented 3 years ago

@peastman : This is a great summary!

I just wanted to refine one part of this:

We want to support hybrid MM/ML potentials. There are two common approaches to this:

Long range interactions are handled entirely with the MM potential. The ML part is used only for short range parts of the potential.

The ML model also affects the long range calculation. For example, in PhysNet a neural network computes the partial charges used to evaluate the long range Coulomb interactions at each time step.

Instead, there is also a third option (Option 0) that is a special simpler case of Option 1:

All interactions except are handled with MM except the vacuum interactions for a subset of atoms (e.g. a ligand) are replaced with QML

If we could support Option 0 initially, and then later Option 1, we cover a lot of use cases.

Another thing that we can easily experiment with to build faster ANI or SchNet-like models is to experiment with reducing the cost of the NN part by model distillation techniques (training faster-to-compute models with the same fidelity) or reducing the number of members of the ensemble for ANI.

giadefa commented 3 years ago

If I understand option 0, we can already support it in ACEMD and Raimondas will port it to OpenMM as soon as it is validated. He is already runnning in GPUGRID.

On Tue, Jan 5, 2021 at 12:24 AM John Chodera notifications@github.com wrote:

@peastman https://github.com/peastman : This is a great summary!

I just wanted to refine one part of this:

We want to support hybrid MM/ML potentials. There are two common approaches to this:

Long range interactions are handled entirely with the MM potential. The ML part is used only for short range parts of the potential.

The ML model also affects the long range calculation. For example, in PhysNet a neural network computes the partial charges used to evaluate the long range Coulomb interactions at each time step.

Instead, there is also a third option (Option 0) that is a special simpler case of Option 1:

All interactions except are handled with MM except the vacuum interactions for a subset of atoms (e.g. a ligand) are replaced with QML

If we could support Option 0 initially, and then later Option 1, we cover a lot of use cases.

Another thing that we can easily experiment with to build faster ANI or SchNet-like models is to experiment with reducing the cost of the NN part by model distillation techniques (training faster-to-compute models with the same fidelity) or reducing the number of members of the ensemble for ANI.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openmm/openmm/issues/2940#issuecomment-754282134, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOU6XFRGKUQAT33YX53SYJE2XANCNFSM4UJQ56FA .

peastman commented 3 years ago

Option 0 isn't really a hybrid potential. It only uses ML for a small piece of the system, but for that piece it uses ML for the entire potential. It doesn't, for example, use ML for the short range part and classical Coulomb for the long range part.

jchodera commented 3 years ago

Option 0 isn't really a hybrid potential.

This is perhaps a matter of definition, but it's certainly a "hybrid potential" in that it includes the simplest way of combining MM and QML in a manner that provides significant accuracy increases. See our preprint.

If I understand option 0, we can already support it in ACEMD and Raimondas will port it to OpenMM as soon as it is validated. He is already runnning in GPUGRID.

@giadefa: Should we connect up to discuss the technical details?

giadefa commented 3 years ago

Yes, we should talk about it. Maybe we can arrange a time next week.

On Tue, Jan 5, 2021 at 8:31 PM John Chodera notifications@github.com wrote:

Option 0 isn't really a hybrid potential.

This is perhaps a matter of definition, but it's certainly a "hybrid potential" in that it includes the simplest way of combining MM and QML in a manner that provides significant accuracy increases. See our preprint https://www.choderalab.org/publications/2020/7/30/towards-chemical-accuracy-for-alchemical-free-energy-calculations-with-hybrid-physics-based-machine-learning-molecular-mechanics-potentials .

If I understand option 0, we can already support it in ACEMD and Raimondas will port it to OpenMM as soon as it is validated. He is already runnning in GPUGRID.

@giadefa https://github.com/giadefa: Should we connect up to discuss the technical details?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openmm/openmm/issues/2940#issuecomment-754850465, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3KUOWUBDGUMU7C5VZ5CITSYNSKBANCNFSM4UJQ56FA .