Open mattwthompson opened 1 month ago
@hannaomi
Hi Matt,
Thanks for raising the issue! I've found that any highly solvated systems with >100000 atoms both the RDkit PDB parser and the MDTraj fallback return the same error. I've managed to write a workaround for this by using openbabel to convert the packmol pdb output to xyz format, and then extracting the coordinates. See below function (uses the output file name as function input).
def get_coords_packmol(output_file_path):
"""
This function reads a PDB file, converts it to XYZ format using Open Babel,
and extracts the atomic symbols and coordinates from the XYZ file.
Parameters:
output_file_path (str): The path to the input PDB file.
Returns:
numpy.ndarray: A 2D numpy array where each row represents an atom and the columns represent the x, y, and z coordinates.
"""
from openbabel import openbabel
obConversion = openbabel.OBConversion()
obConversion.SetInAndOutFormats("pdb", "xyz") #Openbabel conversion to xyz
mol = openbabel.OBMol()
obConversion.ReadFile(mol, output_file_path)
obConversion.WriteFile(mol, 'packbox.xyz')
xyz = open('packbox.xyz')
N = int(xyz.readline())
header = xyz.readline()
atom_symbol, coords = ([] for i in range (2))
for line in xyz:
atom,x,y,z = line.split()
atom_symbol.append(atom)
coords.append([float(x),float(y),float(z)])
coords_arr = np.array(coords) #Extract coords
return coords_arr
# Apply to highly solvated packmol output
# Takes a while to run because of openbabel conversion
arr = get_coords_packmol(output_file_path)
print(arr.shape)
# Construct the output topology - picking back up from pack_box()
added_molecules = []
molecules = [Molecule.from_file('PEG.pdb', file_format='pdb'), water, na, cl] #Must be in the same order as the packmol file
number_of_copies_top = [1] + number_of_copies
solute = topology
for mol, n in zip(molecules, number_of_copies_top):
added_molecules.extend([mol] * n)
topology = Topology.from_molecules(added_molecules)
# Set the positions
topology.set_positions(arr * unit.angstrom)
I have a full notebook that reproduces the error using a PEG topology. Attempting to attach here: packmol_sol.ipynb.zip
Is this extension of the pack_box function a feasible addition to the wrapper? Let me know if you need any more info
Thanks!
In #985 I went with an OpenMM-based solution (if/after RDKit and MDTraj fail) which should work the same in general
Testing with a larger system, though, makes me think it's better to just parse the coordinate columns directly. Each parser spends a lot of time looking as stuff we don't care about here (atom/element information, residue/chain metadata, CONECT records, etc.)
Many thanks, Matt. I agree, that approach has worked well for me so far
Great, I'm happy with how #987 has looked and I'm aiming for that to be in the next release (later this week or early next)
I heard that the PACKMOL wrapper chokes when more than 99,999 atoms are present in the system. I can't reproduce it:
I suspect the issue is with MDTraj or RDKit's loading of the PDB file. There is already some code that attempts to deal with this: https://github.com/openforcefield/openff-interchange/blob/v0.3.26/openff/interchange/components/_packmol.py#L727-L751