Implement our own Topology - what features does it need?

davidlmobley commented 6 years ago

It looks like we're headed towards needing our own Topology object, because as we progress towards including virtual sites, OpenMM's Topology objects won't cover what we need. For example, we will need the ability to have topologies which allow for easy extraction of (a) which particles are physical (chemical), (b) the full set including virtual sites, and (c) utility functionality for extracting physical positions given a full set of positions, etc. This should also preserve bond orders, etc., to allow easy creation of OEMol or RDKit molecules corresponding to the components of the system, etc.

Some of this is discussed here

@jchodera , other thoughts on what this should do/be like?

jchodera commented 6 years ago

Thanks, @davidlmobley! Here are some initial thoughts:

I'd like to see the Topology class have the following capabilities:

Serializable, with support for JSON and BSON

In addition to being pickleable, JSON/BSON support would be nice

serialized_topology = openmforcefield_topology.as_json()
openforcefield_topology = Topology.from_json(serialized_topology)

Do we want an XML representation too?

Convenient instantiation

It would be convenient to easily instantiate from an OpenMM Topology, but we'd need to specify the OEMol or RDKit molecules that go along with this so we can perceive the chemical species in the file:

openforcefield_topology = Topology(openmm_topology, molecules)

or, similar to MDTraj, we could provide a convenience static method

openforcefield_topology = Topology.from_openmm(openmm_topology, molecules)

We could also provide easy instantiation from other files that MDTraj supports, like PDB files:

openforcefield_topology = Topology('myfile.pdb', molecules)

Alternatively, we can focus on instantiating from representations that provide a complete description of coordinates and chemical bonds, but part of the reason we're building this is that they don't really exist.

Pythonic iteration of chemical components

We need to decide what kinds of components we want. Do we want to keep the same chains, residues, atoms organization? Is this compatible with current mmCIF/PDBx hierarchical organization?

We can also make the distinction between enumerating over atoms (real, chemical components) and particles or sites (which may include virtual sites for off-atom charges or Drude oscillators):

# Enumerate over chemical atoms
for atom in openforcefield_topology.atoms:
    print(atom)
# Enumerate over particles (which could include virtual sites or Drude oscillators)
for particle in opneforcefield_topology.particles:
    print(particle)

Easy retrieval of an OpenMM topology

Similar to MDTraj, an easy way to retrieve an OpenMM Topology:

# retrieve all particles
openmm_topology = openforcefield_topology.to_openmm()
# retrieve only chemical atoms
openmm_topology = openforcefield_topology.to_openmm(atoms_only=True)

Awareness of molecules and convenience functions for molecule-based toolkits

We need the new Topology to be aware of molecules and be able to produce OpenEye or RDKit versions of these:

for molecule in openforcefield_topology:
    rdmol = molecule.as_rdkit() # requires RDKit toolkit installed
    oemol = molecule.as_oemol() # requires OpenEye toolkit installed

Or we could have topology.molecules return different types depending on what toolkit we instantiated with:

openforcefield_topology = Topology(openmm_topology, toolkit=openeye)
for oemol in openforcefield_topology.molecules:
   openeye.oeomega.omega(oemol) # oemol is an OpenEye molecule because we instantiated Topology with toolkit=openeye instead of rdkit
   ....

Alternatively, we can go a step further and also create a Molecule class that isolates the functionality we need entirely from rdkit and openeye toolkits, but this runs the risk of requiring us to implement a lot more functionality later.

I think we do want to aim to isolate as much toolkit-dependent functionality within Topology as possible, however.

Convenience enumeration of angles and torsions

for angle in openforcefield_topology.angles:
   print(angle)
for torsion in openforcefield_topology.torsions:
   print(torsion)
propers = [proper for torsion in openforcefield_topology.proper_torsions]
impropers = [proper for torsion in openforcefield_topology.improper_torsions]

SMARTS-based atom match enumeration

We should move our SMARTS-based match enumeration into the Topology:

smarts_matches = openforcefield_topology.get_SMARTS_matches(smirks, aromaticity_model=None, toolkit=rdkit)

Support for annotation of atoms, bonds, angles, torsions, etc.

We could support Wiberg bond order via annotation, or we could make fraction bond order a first-class part of the API. For example:

# Compute Wiberg bond order and annotate bonds
for molecule in openforcefield_topology.molecules:
    oemol = molecule.as_oemol()
    status = openeye.oequacpac.OEAssignPartialCharges(oemol, getattr(oequacpac, oechargemethod), False, False)
    for (bond, oebond) in zip(molecule.bonds, oemol.GetBonds()):
        bond.set_data('WibergBondOrder', oebond.GetData("WibergBondOrder"))

Various schemes for atom selection

For polymers, PyMOL (or perhaps MDTraj DSL) syntax is useful.
For small molecules, SMARTS syntax is useful

Tagging @andrrizzi, @lnaden, @pgrinaway for additional comment.

davidlmobley commented 6 years ago

I agree with all of these, though I don't have much of an understanding of molecule selection syntax at this point.

Wiberg bond order: I think it should be a first-class part of the API, as I think this is going to end up being very important.

Another thing I wonder about is if there should be an easy way to go back and forth between molecules and a ChemicalEnvironments like object for easier interface with SMIRKY type moves down the road; it's not immediately obvious to me how we would want these two things to interface but perhaps there's a reason for it.

@bannanc , any thoughts?

Do we want an XML representation too?

I don't see an obvious reason for this at this point, though I don't have a problem with it. I'd defer.

jchodera commented 6 years ago

Thanks! The biggest questions for you right now are the following:

My current plan is that we would break the API for this, so that ForceField.createSystem() would return a new Topology object and no longer accept a list of molecule objects (since they are encoded inside topology):
```
forcefield = ForceField( 'vsite_forcefield.ffxml')
[system, new_topology] = forcefield.createSystem(topology)
```
This will break all our code, but should be easy to fix the code to update that.
Do you have a preference for how we expose oemol/rdmol objects? Should we
1. Encapsulate all the oechem/rdkit functionality inside of Topology and allow retrieval of both oemol/rdkit molecules on request
2. Have Topology use either the oechem/rdkit toolkit selected upon initialization, but not both
3. Totally isolate oechem/rdkit with our own Molecule class that pushes all toolkit-specific issues into that class

I imagine option 1 is the easiest for now, but risks that we will break things again when we refactor to further isolate the toolkits behind another layer of abstraction. We could potentially just make some parts of the API private now so that users can still access OEMol/RDMol representations but understand those private parts of the API may change.

Another thing I wonder about is if there should be an easy way to go back and forth between molecules and a ChemicalEnvironments like object for easier interface with SMIRKY type moves down the road; it's not immediately obvious to me how we would want these two things to interface but perhaps there's a reason for it.

This should be no problem. We could have openforcefield_topology.get_environment_matches() just accept either SMARTS strings or ChemicalEnvironment objects (which can be reduced to a SMARTS representation).

davidlmobley commented 6 years ago

@jchodera

Thanks! The biggest questions for you right now are the following:

My current plan is that we would break the API for this, so thatForceField.createSystem() would return a new Topology object and no longer accept a list of molecule objects (since they are encoded inside topology): forcefield = ForceField( 'vsite_forcefield.ffxml') [system, new_topology] = forcefield.createSystem(topology)

Yes, I agree with this. We'd been headed towards having Topology not require oemols anyway, and in my view this is not that big a step past that (and a needed step).

Do you have a preference for how we expose oemol/rdmol objects? Should we i. Encapsulate all the oechem/rdkit functionality inside of Topology and allow retrieval of both oemol/rdkit molecules on request ii. Have Topology use either the oechem/rdkit toolkit selected upon initialization, but not both iii. Totally isolate oechem/rdkit with our own Molecule class that pushes all toolkit-specific issues into that class

Well, I really don't like option (ii) because it makes it harder to cross-compare and interoperate, which seems like the opposite of what we want in general. So I think I can rule that one out. I find option (iii) most appealing as it seems the best way to future-proof things (e.g. if someone has another, better chemistry toolkit, or their own internal chemistry toolkit, or whatever, and wants to extend to interface with this new toolkit, it won't break things again or require further abstraction). I could be persuaded to go with option (i) since it is easier but suspect eventually we'll need option (iii). (e.g. what about use in MOE, or Schrodinger tools, or ... ?)

We could potentially just make some parts of the API private now so that users can still access OEMol/RDMol representations but understand those private parts of the API may change.

Yes. I could still see wanting to access OEMol/RDKit representations.

jchodera commented 6 years ago

Just want to point out it is easy to interoperate with option (ii):

# Use openeye under the hood (we may not even need the 'toolkit' argument since we can just inspect the oemols)
openeye_top = Topology(openmm_topology, oemols, toolkit=openeye)
# Use rdkit under the hood
rdkit_top = Topology(openmm_topology, rdmols, toolkit=rdkit)
# Use the features without caring which one we have
for match in oemol_top.find_matches(SMARTS):
   # do something...

Comparison should also be easy---just swap the oemols for rdmols and everything else is identical, except top.get_molecule() would return an RDKit molecule instead of an OEMol.

davidlmobley commented 6 years ago

Oh, sorry, I totally misread the word "initialization" as "installation", ha.

I still like option (iii) best. Otherwise I don't really care about the distinction between option (i) and option (ii).

jchodera commented 6 years ago

OK, so the Molecule class would then have the same kinds of questions:

How should we construct one? Using constructors?

# From an OpenEye molecule
off_mol = Molecule(oemol)
# From an RDKit molecule
off_mol = Molecule(rdmol)

or from static methods?

# From an OpenEye molecule
off_mol = Molecule.from_oechem(oemol)
# From an RDKit molecule
off_mol = Molecule.from_rdkit(rdmol)

How should retrieval in other packages work?

oemol = off_mol.to_oechem()
oemol = off_mol.as_oemol()
oemol = off_mol.oemol

Which functionality should we expose besides the essentials needed for ForceField? Is it sufficient to just let you retrieve/set it as an oemol or rdmol for now?

andrrizzi commented 6 years ago

Everything sounds great to me!

How should we construct one?

If the internal representation of Molecule is toolkit-independent (or will be), I'd keep the main constructor toolkit-independent too and have separate from_oechem/from_rdkit static constructors that perform conversions from another representation. I think it could be a little more future-compatible with changes in OE and RDKit. We could definitely have a separate static constructor for convenience (something like from_molecule()) that uses inspection to figure things out automatically.

How should retrieval in other packages work?

Personally, I'd prefer to_oechem(), although as_oechem() sounds fine too. I don't find the property syntax as intuitive as the other two, in this case.

davidlmobley commented 6 years ago

I think I'm with Andrea here though I don't have strong opinions about it.

jchodera commented 6 years ago

Thanks!

Another major question is whether and how we should hierarchically organize objects within the Topology. Our main use case is molecule-, atom-, and particle-centric, so these are the two most important ways to iterate over the contents. But do we also may want to have a more traditional way to iterate over things, such as chains and residues. Some hierarchical representations also have a few more levels, such as biological unit. (See mmCIF and the OEHierView from OEBio). What should we include here?

Are there advantages to picking something close to one of the current modern PDB formats (mmCIF or PDBx), for example? These formats may even be extensible to provide a way for us to write and read our Topology in an augmented form of these file formats.

davidlmobley commented 6 years ago

I do think we need (or at least want) the ability to iterate over residues at least (I've never seen a use case for chains, but there may be one...?); in the case of biopolymers this tends to become important (e.g. we may need to have this for dealing with grafting of nonstandard amino acids onto standard biopolymer scaffolds, etc.). It's not immediately obvious to me what we would need that would REQUIRE looping over residues, but it seems like we would probably want it so we don't end up getting stuck because we don't have it.

Also, having a notion of these will presumably make it simpler to round-trip to and from OEMol or similar objects in a non-lossy way, I would imagine.

bannanc commented 6 years ago

I'm catching up on this a little late, but it looks really good to me.

I think I agree with David and Andrea on handling molecules it seems like it should be handled under the hood so a use can just ask for an off_mol given either an oechem or rdk molecule and the differences would be handled under the hood. I don't have a strong opinion on the static method vs. constructor option.

@davidlmobley I was a little confused about this:

Another thing I wonder about is if there should be an easy way to go back and forth between molecules and a ChemicalEnvironments like object for easier interface with SMIRKY type moves down the road; it's not immediately obvious to me how we would want these two things to interface but perhaps there's a reason for it.

Are you thinking the whole molecule should be stored as a ChemicalEnvironment? I think chemical environments should be able to parse SMILES strings right now so you could build one that way. However, I'm not sure what the value of having a whole molecule in a ChemicalEnvironment would be useful. The point of the ChemicalEnvironments in my view was to be able to make changes to SMIRKS patterns, while you could do this for larger substructures (i.e. whole molecules), but I'm not sure what the value would be.

jchodera commented 6 years ago

The point of the ChemicalEnvironments in my view was to be able to make changes to SMIRKS patterns, while you could do this for larger substructures (i.e. whole molecules), but I'm not sure what the value would be.

I think of the main use case here as rendering a ChemicalEnvironment to SMARTS to pull matches out of a Topology---which is very easy to support---but maybe @davidlmobley is thinking of something else?

jchodera commented 6 years ago

One more thing that just came up with Perses: Rendering Topology and Molecule objects as a networkx graph is often really convenient, though we can also encode common useful operations (such as enumerating all bonds involved in rings) as methods if we find them useful.

davidlmobley commented 6 years ago

Are you thinking the whole molecule should be stored as a ChemicalEnvironment? I think chemical environments should be able to parse SMILES strings right now so you could build one that way. However, I'm not sure what the value of having a whole molecule in a ChemicalEnvironment would be useful. The point of the ChemicalEnvironments in my view was to be able to make changes to SMIRKS patterns, while you could do this for larger substructures (i.e. whole molecules), but I'm not sure what the value would be.

No, I think the way John interpreted it above was what I had in mind. I don't see a use case for storing a whole molecule as a chemical environment routinely.

bannanc commented 6 years ago

OK, I think that makes more sense. Otherwise, I think this seems like a good plan.

davidlmobley commented 5 years ago

This is closed; we now have our own Topology.

openforcefield / openff-toolkit