openforcefield / openff-interchange

A project (and object) for storing, manipulating, and converting molecular mechanics data.
https://docs.openforcefield.org/projects/interchange
MIT License
71 stars 22 forks source link

LAMMPS atom writer makes assigning unique IDs to chemically-identical molecules impossible #998

Closed timbernat closed 2 months ago

timbernat commented 3 months ago

Description

I am trying to use Interchange to generate LAMMPS input files for some large polymer systems, which consist of many repeated-and-translated copies of oligomers, plasticizers, and solvent water molecules. I've found that in every case, a unique molecule ID in the atoms table section of the resulting LAMMPS file is only generated for each distinct Molecule chemistry, rather than for each individual Molecule in the Topology.

Some further digging reveals that the culprit is a speed-up update to the LAMMPS valence writers, in particular the _write_atoms() function, which generates molecule IDs from a chemical table hash of each molecule, allegedly to bypass an extra toolkit wrapper conversion step. This molecule CTAB hash, in turn, is only based on basic chemical info of the atoms and bonds in the Molecule object (namely atom symbols, formal charges, stereo, and bond orders).

The consequence of this is that any distinct Molecules with identical chemical tables in the same Topology will unavoidably assigned the same molecule ID in the atoms table when writing to LAMMPS

Reproduction

This turns out to affect even really basic systems, such as a box filled with nothing but water. To illustrate, one could take the waterbox.pdb file found here and execute the following code:

from pathlib import Path
from openff.toolkit import Topology, ForceField

waterbox_path = Path('waterbox.pdb')
watertop = Topology.from_pdb(waterbox_path)

ff = ForceField('tip3p.offxml')
winc = ff.create_interchange(watertop)
winc.to_lammps('water.lammps')

Output

All atoms in the resulting water.lammps file will be labelled with molecule ID 500 (the number of unique water molecules in the Topology), rather than unique IDs from 1-500.

Solution The simplest solution from my point of view is to salt the molecule ID hash in _write_atoms() with the molecule ID to ensure distinct molecules with identical CTABs are hashed as such. This would look something like (interop/lammps/export/export.py, lines 296-300):

    for molecule_index, molecule in enumerate(interchange.topology.molecules):
        molecule_hash = hash(f'{molecule_index}_{molecule.ordered_connection_table_hash()}') # salt with mol ID
        molecule_map[molecule_hash] = molecule_index
        for atom in molecule.atoms:
            atom_molecule_map[atom] = molecule_hash

I've tested on my machine (via manual source code edits in a conda env) that this circumvents the non-uniqueness issue, and have put together a simple PR to do exactly this. Software versions

PEFrankel commented 3 months ago

I was just working on this very issue and this seems to prove it's not isolated since it persists within interchange.to_gromacs as well. A broader solution may be necessary to fix this (i.e. somewhere in interop or just copied in each export).

j-wags commented 3 months ago

Thanks @timbernat for the great writeup and quick PR. Matt is on a much-needed vacation right now so I'll help polish up the PR but his final review won't come until next week :-)

mattwthompson commented 2 months ago

We're pretty sure this is fixed in #1000, which I expect to include in 0.3.28