openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
309 stars 90 forks source link

Atom-level and residue-level metadata easily fall out of sync #1921

Open mattwthompson opened 1 month ago

mattwthompson commented 1 month ago

Describe the bug

Updating an atom's residue-relevant metadata does not update the same data on the corresponding residue object. Editing the residue also doesn't update the atom. I'm not sure if this is the intended behavior - if it is, there needs to be safeguards or at least documentation, and if it's not, this is a bug.

To Reproduce

from openff.toolkit import Topology

from openff.utilities import get_data_file_path

protein = Topology.from_pdb(
    get_data_file_path(
        "proteins/MainChain_HIE.pdb",
        "openff.toolkit",
    ),
).molecule(0)

def check():
    for index in [0, -1]:
        assert [*protein.residues][index].residue_number == protein.atom(
            index,
        ).metadata["residue_number"], index

check()  # Passes

for atom in protein.atoms:
    atom.metadata["residue_number"] = str(int(atom.metadata["residue_number"]) + 10)

try:
    check()
except AssertionError:
    print("Mismatch if atoms are edited")

# Reset changes at atom level
for atom in protein.atoms:
    atom.metadata["residue_number"] = str(int(atom.metadata["residue_number"]) - 10)

for atom in protein.atoms:
    atom.metadata["residue_number"] = str(int(atom.metadata["residue_number"]) + 10)

try:
    check()
except AssertionError:
    print("Mismatch if residues are edited")

Output

$ python indexing.pyThe OpenEye Toolkits are found to be installed but not licensed and therefore will not be used.
The OpenEye Toolkits require a (free for academics) license, see https://docs.eyesopen.com/toolkits/python/quickstart-python/license.html
The OpenEye Toolkits are found to be installed but not licensed and therefore will not be used.
The OpenEye Toolkits require a (free for academics) license, see https://docs.eyesopen.com/toolkits/python/quickstart-python/license.html
Mismatch if atoms are edited
Mismatch if residues are edited

Computing environment (please complete the following information):

Additional context

This behavior makes sense given what I know about the design, but I think it's reasonable for a user to expect that updating residue data on an atom might correspondingly update the residue itself. The Atom.metadata field is implied to do this sort of thing:

An optional dictionary where keys are strings and values are strings or ints. This is intended to record atom-level information used to inform hierarchy definition and iteration, such as grouping atom by residue and chain.

If this is the intended behavior, it'd be nice to have it documented in some sort of "PDB loading cookbook" https://github.com/openforcefield/openff-toolkit/issues/1554

mattwthompson commented 1 month ago

Something might be going on here, can't tell if this is related


ipdb> protein.atom(0).metadata["residue_number"]
'14'
ipdb> protein.residues[0].residue_number
'14'
ipdb> protein.to_topology().atom(0).metadata["residue_number"]
'14'
ipdb> protein.to_topology().residues[0].residue_number
'1'
j-wags commented 1 month ago

Thanks for the great writeup. It's not a bug, just a poorly documented design.

The core issue is that Atom and Molecule/Topology-level metadata aren't kept in sync by default.

Changes to Atom-level metadata can be propagated to relevant HierarchyElements by running offmol.percieve_hierarchy.

There's currently no way to propagate HierarchyElement metadata changes back to the underlying atoms.

Some options I could see are:

Happy to take a PR for any of these or discuss further.

mattwthompson commented 1 month ago

My low-effort before-coffee opinion on this now is that we should quickly document most of this behavior (~few days) before we make any decisions about major design changes (~weeks to several months)

Yoshanuikabundi commented 1 month ago

FWIW, this is documented:

Hierarchy schemes are not updated dynamically; if a Molecule with hierarchy schemes changes, Molecule.update_hierarchy_schemes() must be called before the scheme is iterated over again or else the grouping may be incorrect.

When I was writing the documentation for this feature, I did find it a bit difficult to decide where to put this information, because a lot of users will only interact with, say, Molecule.residues, which can't be documented because it's not really a method. I kind of ended up just trying to put it everywhere. I'm a bit miffed that I didn't put it in Topology.hierarchy_iterator(), because that seems like an obvious place that people might run into it. If you have any other suggestions for more places to mention this, or clearer ways to phrase it, I'm all ears!

I also agree that dynamic iterators would be much less surprising.

mattwthompson commented 1 month ago

FWIW, I searched mostly for "metadata" since that's the way I was interacting with the API. I appreciate that this behavior is mentioned several times in those methods, but they're also methods I didn't need to interact with to get this behavior.

My only suggestion at the documentation level is to include it when Atoms are described, i.e. here or a probably-copy-pasted-bit in Molecule.add_atom:

image

A "Beware! This other thing ... see here" might have helped in this case.

Anyway, I think making the actual behavior less squishy would be more impactful than documenting it a fifth time over. The power and flexibility of this functionality has huge upside but I seem to keep shooting myself in the foot when I try to hook into it for tasks I naively think are straightforward