openforcefield / openff-interchange

A project (and object) for storing, manipulating, and converting molecular mechanics data.
https://docs.openforcefield.org/projects/interchange
MIT License
64 stars 22 forks source link

PACKMOL wrapper with >100,000 atoms #984

Open mattwthompson opened 1 month ago

mattwthompson commented 1 month ago

I heard that the PACKMOL wrapper chokes when more than 99,999 atoms are present in the system. I can't reproduce it:

In [1]: from openff.toolkit import Molecule, unit

In [2]: import numpy

In [3]: from openff.interchange.components._packmol import pack_box

In [4]: molecule = Molecule.from_smiles("c1cc(Cl)ccc1C(=O)CS[C]1=CO[C](F)(F)CC1")

In [5]: molecule.generate_conformers(n_conformers=1)

In [6]: packed = pack_box(
   ...:     molecules=[molecule],
   ...:     box_vectors=100 * numpy.eye(3) * unit.nanometer,
   ...:     number_of_copies=[3334],
   ...: )
[13:51:07] Non-integer PDB serial number 186A0

In [7]: packed.n_atoms
Out[7]: 100020

I suspect the issue is with MDTraj or RDKit's loading of the PDB file. There is already some code that attempts to deal with this: https://github.com/openforcefield/openff-interchange/blob/v0.3.26/openff/interchange/components/_packmol.py#L727-L751

mattwthompson commented 1 month ago

@hannaomi

hannaomi commented 1 month ago

Hi Matt,

Thanks for raising the issue! I've found that any highly solvated systems with >100000 atoms both the RDkit PDB parser and the MDTraj fallback return the same error. I've managed to write a workaround for this by using openbabel to convert the packmol pdb output to xyz format, and then extracting the coordinates. See below function (uses the output file name as function input).

def get_coords_packmol(output_file_path):
    """
    This function reads a PDB file, converts it to XYZ format using Open Babel, 
    and extracts the atomic symbols and coordinates from the XYZ file.

    Parameters:
    output_file_path (str): The path to the input PDB file.

    Returns:
    numpy.ndarray: A 2D numpy array where each row represents an atom and the columns represent the x, y, and z coordinates.
    """
    from openbabel import openbabel
    obConversion = openbabel.OBConversion()
    obConversion.SetInAndOutFormats("pdb", "xyz") #Openbabel conversion to xyz

    mol = openbabel.OBMol()
    obConversion.ReadFile(mol, output_file_path) 
    obConversion.WriteFile(mol, 'packbox.xyz')
    xyz = open('packbox.xyz')
    N = int(xyz.readline())
    header = xyz.readline()
    atom_symbol, coords = ([] for i in range (2))
    for line in xyz:
        atom,x,y,z = line.split()
        atom_symbol.append(atom)
        coords.append([float(x),float(y),float(z)])
    coords_arr = np.array(coords) #Extract coords

    return coords_arr

# Apply to highly solvated packmol output
# Takes a while to run because of openbabel conversion
arr = get_coords_packmol(output_file_path)

print(arr.shape)

# Construct the output topology - picking back up from pack_box()
added_molecules = []
molecules = [Molecule.from_file('PEG.pdb', file_format='pdb'), water, na, cl] #Must be in the same order as the packmol file
number_of_copies_top = [1] + number_of_copies
solute = topology
for mol, n in zip(molecules, number_of_copies_top):
    added_molecules.extend([mol] * n)
topology = Topology.from_molecules(added_molecules)

# Set the positions
topology.set_positions(arr * unit.angstrom)

I have a full notebook that reproduces the error using a PEG topology. Attempting to attach here: packmol_sol.ipynb.zip

Is this extension of the pack_box function a feasible addition to the wrapper? Let me know if you need any more info

Thanks!

mattwthompson commented 1 month ago

In #985 I went with an OpenMM-based solution (if/after RDKit and MDTraj fail) which should work the same in general

Testing with a larger system, though, makes me think it's better to just parse the coordinate columns directly. Each parser spends a lot of time looking as stuff we don't care about here (atom/element information, residue/chain metadata, CONECT records, etc.)

hannaomi commented 1 month ago

Many thanks, Matt. I agree, that approach has worked well for me so far

mattwthompson commented 1 month ago

Great, I'm happy with how #987 has looked and I'm aiming for that to be in the next release (later this week or early next)