openmm / openmm-ml

High level API for using machine learning models in OpenMM simulations
Other
76 stars 25 forks source link

Models requiring more information #30

Open peastman opened 2 years ago

peastman commented 2 years ago

createSystem() takes a Topology as its input. That's fine for ANI, but some models will require more information. We'll definitely need formal charges, and we might need hybridization states or bond orders. We need to extend the API in some way to allow this information to be determined. Here are some ideas for approaches we might take. These aren't exclusive. We could do more than one of them.

We could extend Topology to store more information. It already can store bond orders, and it would be easy to add formal charges. On its own, this is only one piece of a solution. It still leaves the problem of where to get the information from. None of the standard file formats OpenMM imports contains it. That's why in practice Topologies never specify bond orders, even though in principle they can.

We could copy the approach used by ForceField. It would be easy to create a pseudo-forcefield that would fill in chemical information for standard residues by matching templates. Nonstandard ones could be specified manually similar to the way SMIRNOFFTemplateGenerator works.

Another possibility is to allow createSystem() to accept an OpenFF Topology in place of the OpenMM Topology. It already provides mechanisms for building descriptions of the relevant information. Models that don't need additional information would work with either type of topology.

We also could try to determine the missing information automatically based on what we do have (elements, bonds, positions). RDKit can do this. In practice I find it isn't very robust, though, so this probably isn't a good idea.

jchodera commented 1 year ago

We'll definitely need formal charges, and we might need hybridization states or bond orders.

There are several important caveats here:

It's worth asking whether we really need to support these things, since it (1) carries significant infrastructure burden, (2) essentially requires we carry over all of the awful things about MM into the QML world, hindering rapid progress.

We could extend Topology to store more information. It already can store bond orders, and it would be easy to add formal charges. On its own, this is only one piece of a solution. It still leaves the problem of where to get the information from. None of the standard file formats OpenMM imports contains it. That's why in practice Topologies never specify bond orders, even though in principle they can.

We would still need to choose (and possibly implement) a single aromaticity model even if we did this, which carries huge infrastructure costs.

We could copy the approach used by ForceField. It would be easy to create a pseudo-forcefield that would fill in chemical information for standard residues by matching templates. Nonstandard ones could be specified manually similar to the way SMIRNOFFTemplateGenerator works.

This would bring over significant limitations from MM into the QML regime.

Another possibility is to allow createSystem() to accept an OpenFF Topology in place of the OpenMM Topology. It already provides mechanisms for building descriptions of the relevant information. Models that don't need additional information would work with either type of topology.

The OpenFF molecular topology representations have solved some, but not all, of these problems: They adopt a single aromaticity model implemented (almost consistently) in multiple cheminformatics toolkits, and provide access to bond orders and formal charges. The formal charges are not unique with resonance form, creating the chemical equivalence problem. And, for biopolymers, they use templates from the PDB Chemical Component Dictionary to do template matching to determine bond orders, which is also a significant burden.

We should ask ourselves: Do we really want this? Or do we want to focus on architectures that do not require this legacy information, freeing ourselves of the problems inherent to MM potentials?

peastman commented 1 year ago

It's not so much a question of what we want, but rather of what a particular potential function requires. For example, if a particular potential function requires information about bonds, it clearly won't be able to model chemical reactions. But there are lots of applications where that's fine, and there are likely to be potential functions designed for that purpose. If we have no way to specify bond information, it will be impossible for us to support those potentials.

Some potentials will require nothing but elements, and those ones will be easy to support. But we don't want to be limited to only those ones.