Virtual sites: Group particles from common matches into virtual "sites"

openforcefield / openff-interchange

A project (and object) for storing, manipulating, and converting molecular mechanics data.

https://docs.openforcefield.org/projects/interchange

MIT License

69 stars 22 forks source link

Virtual sites: Group particles from common matches into virtual "sites" #227

Open mattwthompson opened 3 years ago

mattwthompson commented 3 years ago

For reasons that make the implementation in the toolkit tractable, the toolkit groups multiple virtual particles into virtual sites so that a virtual site can, and often does, define a group of several virtual particles. This is confusing to me, if only for the fact that I have always thought a "virtual site" corresponds to a single discrete point in space, not potentially a collection of them. The SMIRNOFF spec reads like it's based off of this assumption.

We should decide, ideally early on, if we want virtual sites to exist as abstract groupings of particles or (my preference) there is no distinction between the two, and create a different term for a grouping of multiple particles.

https://open-forcefield-toolkit.readthedocs.io/en/0.9.2/virtualsites.html#support-for-the-smirnoff-virtualsite-tag https://open-forcefield-toolkit.readthedocs.io/en/0.9.2/smirnoff.html#virtualsites-virtual-sites-for-off-atom-charges

mattwthompson commented 3 years ago

I think one place in which this manifests is whether or not to have the behavior of _reduce_virtual_particles_to_sites as happens here, i.e. if maybe there are other ways to be able to group and query individual particles (probably through a good design of VirtualSiteKey) without creating the virtual site layer.

mattwthompson commented 3 years ago

An attempt at a minimal snippet:

>>> from openff.toolkit.tests.test_forcefield import *
>>> file_path = get_data_file_path("test_forcefields/test_forcefield.offxml")
>>> forcefield = ForceField(file_path, TestForceFieldVirtualSites.xml_ff_virtual_sites_bondcharge_match_once)
>>> top = Molecule.from_smiles("N#N").to_topology()
>>> matches = forcefield['VirtualSites'].find_matches(top)
>>> sites = forcefield['VirtualSites']._reduce_virtual_particles_to_sites(matches)
>>> matches
{(0,
  1): [<openff.toolkit.typing.engines.smirnoff.parameters.ParameterHandler._Match at 0x7faa0db07820>],
 (1,
  0): [<openff.toolkit.typing.engines.smirnoff.parameters.ParameterHandler._Match at 0x7faa0d6930d0>]}
>>> matches[(0, 1)][0].parameter_type
<VirtualSiteBondChargeType with smirks: [#7:1]~[#7:2]  epsilon: 0.2 kcal/mol  distance: 0.2 nm  charge_increment1: 0.2 e  
charge_increment2: 0.2 e  type: BondCharge  sigma: 0.2 A  name: EP  match: once  >
>>> sites
[[<VirtualSiteBondChargeType with smirks: [#7:1]~[#7:2]  epsilon: 0.2 kcal/mol  distance: 0.2 nm  charge_increment1: 0.2 e  charge_increment2: 0.2 e  type: BondCharge  sigma: 0.2 A  name: EP  match: once  >,
  [(0, 1), (1, 0)]]]

is the information content of matches sufficient to store the state we're after? As a reminder, here's what we currently think that is:

General non-bonded parameters (values of the sigma/epsilon, charge increments)
Type-specific parameters (here: distance, but also angles for other types)
Type of virtual site (i.e. BondCharge here)
"Handedness" (implied by the ordering of the atom indices in the tuples)
Other data I'm forgetting? @trevorgokey would likely know

Above is also probably not covering various tricky cases that I have not considered - not sure if it's best to hammer out a simple case and stress it with complexity or worry about the complexity from the gate.

Maybe to get things back on track: what again are the benefits/downsides of grouping particles into sites, and is that the best approach for this representation?

davidlmobley commented 3 years ago

It seems like at some level you're asking about two things here: 1) Philosophically, what our terminology ought to be (is a "virtual site" a single particle/single site, or multiple ones?) 2) The code behavior of this

I have few thoughts about (2) at this point, but I agree that (1) may be confused/confusing at present. I think there are some good reasons why, for some specific chemistries, we want a single SMARTS pattern/"virtual site definition" to result in the placement of more than one virtual particle/virtual site (e.g. for TIP5P water for example, where you want to be able to use a single pattern to tell it where to put both virtual sites).

I have to admit I'm not quite sure whether I understand your language here: "We should decide, ideally early on, if we want virtual sites to exist as abstract groupings of particles or (my preference) there is no distinction between the two, and create a different term for a grouping of multiple particles." e.g. "between the two of what?"

I think what we want is (some term to refer to groups of virtual particles) and (some term to refer to how a virtual particle or group of virtual particles is defined in the spec), and then to use "virtual site" to describe the place where a single virtual particle is placed. I think if we use "virtual site" to describe an abstract grouping of particles that will be confusing, since it's a singular term. (And yes, I acknowledge that's probably how we currently use the term.)

trevorgokey commented 3 years ago

The primary reason for coupling multiple particles in a virtual site is mainly for force field fitting. TIP5P is the typical example, where we want only one angle and distance to represent both particles, and only have 2 degrees of freedom instead of 4. If this interchange object can abstract the ability to couple parameters in this way, I think the necessity for a virtual site to hold multiple particles can be removed.

Based on the last meeting, the interchange design appears to be headed more towards virtual sites being completely determined by the parameterization, with only some light read-only capabilities of virtual site properties in the topology (via the TopologyKey/PotentialKey coupling). This design essentially releases the need to group particles into sites, since the various weights/displacements are pulled directly from the FF/handler and we never want to manipulate the parameterization from a molecule/topology object. During the design of virtual sites, I was trying to support the existing code that was already there, which was the ability to define virtual sites in the molecule object with the assumed ability to somehow export the definitions directly from the molecule. This use case seems to be taken off of the table now.

As for the information/state needed. The name/smirks/type defines uniqueness during parameterization and is essentially only needed when finding virtual sites, e.g. in find_matches. The keyword match instructs how many particles to create during find_matches, and the rest are the typical FF parameters. Lastly, just to put it out there, is how to define the exclusions/exceptions. This is a handler-level attribute, but the exclusions are handled during parameterization/OpenMM system creation.

I think one place in which this manifests is whether or not to have the behavior of _reduce_virtual_particles_to_sites

Removing this method would likely get you 95% to what you want. However, this would be outside the scope of the interchange it seems... but yes, you could e.g. change it to _couple_virtual_sites or something instead, depending on your implementation.

Food for thought, but virtual sites would have been much easier to implement if each virtual site type was its own handler.

j-wags commented 3 years ago

TIP5P is the typical example, where we want only one angle and distance to represent both particles, and only have 2 degrees of freedom instead of 4. If this interchange object can abstract the ability to couple parameters in this way, I think the necessity for a virtual site to hold multiple particles can be removed.

I was just drafting a response as well, but @trevorgokey puts it perfectly here.

Overall I'm OK with storing virtual particles without grouping them into virtual sites. The only thing we lose here is the ability to say "This TIP5P virtual particle is related to this other virtual particle on the same water molecule". But as both @mattwthompson and @trevorgokey say, a user could reconstruct that hierarchy using something like the _reduce.. logic linked above.

Additionally, NOT encoding any assumptions about the grouping of virtual particles will make it more likely that we can interoperate nicely with vsites that come from other ecosystems with different assignment rules.

j-wags commented 3 years ago

I do see @davidlmobley 's point about terminology.

We've kinda painted ourselves into a corner, since the SMIRNOFF spec uses the singular term VirtualSite in a way that could correspond to many particles, and so does the current Molecule object. As a starting point for iteration, how would this terminology work?

"Virtual site" = "Virtual particle" = a single off-center interaction particle
"Virtual site PARAMETER" = something with the information content from a single line in an OFFXML

"Virtual site parent" or "virtual site group"? = something with the information content of sites[0] from above (copied here)

>>> sites
[[<VirtualSiteBondChargeType with smirks: [#7:1]~[#7:2]  epsilon: 0.2 kcal/mol  distance: 0.2 nm  charge_increment1: 0.2 e  charge_increment2: 0.2 e  type: BondCharge  sigma: 0.2 A  name: EP  match: once  >,
[(0, 1), (1, 0)]]]

mattwthompson commented 3 years ago

@davidlmobley I was not clear but your interpretation was exactly the point I was trying to get across. I agree with your proposal and will try to figure out some good terms and definitions to use - probably something in line with @j-wags's recommendations.

@trevorgokey In the TIP5P case, I think that the current sign exhibits the reduced degrees of freedom; I've pulled this together from the unit tests you wrote last year

In [13]: matches = off_ff['VirtualSites'].find_matches(molecule1.to_topology())

In [14]: [x[0].parameter_type for x in matches.values()]  # These are equivalent, I think
Out[14]:
[<VirtualSiteDivalentLonePairType with smirks: [#1:1]-[#8X2H2+0:2]-[#1:3]  epsilon: 0.0 kcal/mol  distance: 0.7 A  charge_increment1: 0.1205 e  charge_increment2: 0.0 e  charge_increment3: 0.1205 e  type: DivalentLonePair  outOfPlaneAngle: 54.71384225 deg  sigma: 1.0 A  name: EP  match: all_permutations  >,
 <VirtualSiteDivalentLonePairType with smirks: [#1:1]-[#8X2H2+0:2]-[#1:3]  epsilon: 0.0 kcal/mol  distance: 0.7 A  charge_increment1: 0.1205 e  charge_increment2: 0.0 e  charge_increment3: 0.1205 e  type: DivalentLonePair  outOfPlaneAngle: 54.71384225 deg  sigma: 1.0 A  name: EP  match: all_permutations  >]

In [15]: [*matches.keys()]
Out[15]: [(0, 1, 2), (2, 1, 0)]

Because the two parameter_types are identical, I think these would map onto the same PotentialKey (and Potential), so force field fitting would involve tuning just distance or outOfPlaneAngle, not duplicates of each, and have those changes reflected in the geometry of each particle. (This is assuming that the VirtualSiteKey can contain enough information to place them correctly; I think we are optimistic about that but I have not thought about it as much yet.) Does this make sense/sound okay?

trevorgokey commented 3 years ago

Yeah that sounds great. Since you are using more sophisticated Topology keys, I am assuming you won't be storing just the tuple (0, 1, 2) as the key or to establish uniqueness. It is possible to get the same tuple for a BondCharge if the SMIRKS tags 3 atoms.

trevorgokey commented 3 years ago

To build on what I just wrote, the current implementation takes the list virtual sites per tuple produced by find_matches (e.g. (0,1,2)), and one of the tasks of _reduce_virtual_particles_to_sites is to go through the lists and determine which virtual site parameters are the "same" and should be coupled (via creating the site layer).

mattwthompson commented 3 years ago

What information content is necessary to ensure uniqueness? I assume we'll be storing a tuple of atom indices, the type of virtual site - what else? We can construct VirtualSiteKey to fit our needs here.

mattwthompson commented 3 years ago

Or is checking for/dealing with duplicates another processing step that needs to be done?

trevorgokey commented 3 years ago

I buried this in my long-winded response above, but to establish uniqueness we needed smirks/type/name/indices to fully address a single particle.