openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
302 stars 88 forks source link

How to keep parity on the residue & chain name assignment behaviour with v0.11 breaking change? #1764

Open hjuinj opened 8 months ago

hjuinj commented 8 months ago

With release 0.11.0 the Topology class was changed. https://docs.openforcefield.org/projects/toolkit/en/stable/releasehistory.html#atom-metadata-and-hierarchy-schemes-for-iterating-over-residues-chains-etc

In particular if I called off_mol.to_topology().to_openmm() prior to this version, the toolkit will automatically try to find a residue name in the mol_to_residues dictionary and if not found, assign off_mol.name to be the residue name. Similar automation is done for chain assignment.

Beyond 0.11.0, the residue name is set to 'UNK' and chain id to 'X' https://github.com/openforcefield/openff-toolkit/blob/575501a2d736c176e8a88c387d8afa261a0c1f94/openff/toolkit/topology/topology.py#L1841-L1864

What is the best way to keep parity with this behaviour beyond version 0.11? Do I have to iterate through each atom in the off_mol and update the atom metadata field? Or is there a more elegant way? Thank you

mattwthompson commented 8 months ago

If I understand correctly, you're trying to ensure the openmm.app.Topology has residues named according to the various Molecule.names? Mapping molecule names (which don't have standardized meanings) to residue names (which certainly do) is something of a transformation that arguably shouldn't have been done previously. I'm not sure if logic like below could be considered

                if "residue_name" in atom.metadata:
                    atom_residue_name = atom.metadata["residue_name"]
                else:
-                    atom_residue_name = "UNK"
+                    atom_residue_name = "UNK" if atom.molecule.name is not None else atom.moleucle.name

Do I have to iterate through each atom ...

This is the only way I know how to do it (which isn't to say there isn't another way!) since residue information is stored on the atoms themselves. I previously found this to be less-than-ideal, but I forget if we discussed a different way of updating residue information.

The main author of this functionality (@j-wags) is out until December; I'd like his feedback before modifying any of this behavior. For now I think you're stuck iterating through the atoms, but in the future it might be smoother.

hjuinj commented 8 months ago

Thank you for the prompt reply Matt, this sounds good, it will be great if this can be improved in the future.

mattwthompson commented 8 months ago

Thanks for your patience!

I think the documentation could be improved, as tracked in the linked issue. Two features I'd like to get public 👍/👎 on from Jeff when he returns are

  1. Setting residues (or other hierarchy information) from the Topology/Molecule APIs
  2. Falling back to Molecule.name, if present and if atoms do not have residue names, in the above OpenMM conversion(s).

If either of these are good ideas, let's open separate tickets for them.

j-wags commented 7 months ago

@mattwthompson is right - The only way to do it currently is to iterate over all atoms and update their metadata.

Setting residues (or other hierarchy information) from the Topology/Molecule APIs

👍 Good idea, I'd accept a PR with these changes

Falling back to Molecule.name, if present and if atoms do not have residue names, in the above OpenMM conversion(s).

👎 I agree that this was an unnecessary behavior change in 0.11, which I wish I'd caught. But now this is a behavior that users who DID promptly update to 0.11 (or who started using OFFTK since then) have built around for 16 months, and to revert it after all this time would effectively be a second behavior change. Considered as a feature request against the current behavior, the benefit of this change is ambiguous. Most molecule names are more than 3 characters, which leads to confusion when writing PDBs or other formats that expect 3-character resnames. And the automagic "sometimes we use the data from here, but other times we use the data from there" pattern is a common cause of user confusion/bugs in workflows that involve OFFTK. So overall I don't think I'd welcome this change, though maybe I'm overlooking something.