openforcefield / openff-interchange

A project (and object) for storing, manipulating, and converting molecular mechanics data.
https://docs.openforcefield.org/projects/interchange
MIT License
71 stars 22 forks source link

Improve the conversion performance of Interchange object to LAMMPS data file #608

Open taidbui opened 1 year ago

taidbui commented 1 year ago

Description It takes too long to write LAMMPS data file for a medium system size.

I experienced this when I tried to convert a rather big system, with 174200 atoms, 165400 bonds, 296000 angles, 415400 dihedrals, and 14400 impropers. I looked into the lammps.py script within https://github.com/openforcefield/openff-interchange/blob/main/openff/interchange/interop/internal/lammps.py and believe that we can optimise the code.

For example in the _write_propers function, we can eliminate the need of using interchange.topology.propers as all information we need is already included in the interchange["ProperTorsions]. Also, using the nested for loop to look up the indices in both interchange['ProperTorsions'] and interchange.topology.propers objects significantly slows down the conversion.

def _write_propers(lmp_file: IO, interchange: Interchange):
    """Write the Dihedrals section of a LAMMPS data file."""
    lmp_file.write("\nDihedrals\n\n")

    proper_handler = interchange["ProperTorsions"]
    proper_type_map = dict(enumerate(proper_handler.potentials))

    proper_type_map_inv = dict({v: k for k, v in proper_type_map.items()})

    for proper_idx, proper in enumerate(interchange.topology.propers):
        indices = tuple(interchange.topology.atom_index(a) for a in proper)

        for top_key, pot_key in proper_handler.key_map.items():
            if indices == top_key.atom_indices:
                proper_type_idx = proper_type_map_inv[pot_key]

                lmp_file.write(
                    "{:d}\t{:d}\t{:d}\t{:d}\t{:d}\t{:d}\n".format(
                        proper_idx + 1,
                        proper_type_idx + 1,
                        indices[0] + 1,
                        indices[1] + 1,
                        indices[2] + 1,
                        indices[3] + 1,
                    ),

So, I rewrote this and tested on a system with 17669 dihedrals using the following and found that the time required to write the dihedrals section in the lammps datafile is much shorter (Wall time: 212 ms vs 25.1 s)

def _write_propers(lmp_file: IO, interchange: Interchange):
    """Write the Dihedrals section of a LAMMPS data file."""
    lmp_file.write("\nDihedrals\n\n")

    proper_handler = interchange["ProperTorsions"]
    proper_type_map = dict(enumerate(proper_handler.potentials))
    proper_type_map_inv = dict({v: k for k, v in proper_type_map.items()})

    top_keys = list(proper_handler.slot_map.keys())
    top_atom_indices = [top_key.atom_indices for top_key in top_keys]

    dihedral_index = {}
    dihedral_indices = []
    for atom_indices in top_atom_indices:
        if atom_indices not in dihedral_index:
            dihedral_index[atom_indices] = len(dihedral_index)
        dihedral_indices.append(dihedral_index[atom_indices])

    pot_keys = [proper_handler.slot_map[top_key] for top_key in top_keys]

    proper_type_idxs = [proper_type_map_inv[pot_key] for pot_key in pot_keys]

    lmp_file.writelines([
        "{:d}\t{:d}\t{:d}\t{:d}\t{:d}\t{:d}\n".format(
            proper_idx + 1, 
            proper_type_idx + 1,
            index[0] + 1, 
            index[1] + 1,
            index[2] + 1, 
            index[3] + 1)
        for proper_idx, proper_type_idx, index in zip(dihedral_indices, proper_type_idxs, top_atom_indices)
    ])

I believe we can optimise the rest of the lammps.py code with a similar approach.

Thanks

mattwthompson commented 1 year ago

Thanks again for the feedback and thorough reporting! I'm not surprised to see this; writing out a medium-to-large system with LAMMPS hits a trifecta of weakpoints of Interchange today

All this being said, these are priorities and I do want to improve the LAMMPS writer and get Interchange working better and quicker with larger systems. It's just that each has been a lower priority than the other functionality that has kept me busy lately. I'd be happy to review a PR that implements these changes - otherwise I'll yoink them and add them myself at some point. The ugly loops-within-loops strategy you're running up against is an early implementation that I just haven't gotten around to changing in each writer (I think the GROMACS writer has these changes).