openforcefield / openff-toolkit

The Open Forcefield Toolkit provides implementations of the SMIRNOFF format, parameterization engine, and other tools. Documentation available at http://open-forcefield-toolkit.readthedocs.io
http://openforcefield.org
MIT License
309 stars 90 forks source link

Problem creating topology for POPC using Interchange.from_smirnoff() #1929

Open dkchalmers opened 3 weeks ago

dkchalmers commented 3 weeks ago

Describe the bug

I am trying to create gromacs toppologies of lipid bilayers. However, using the function Interchange.from_smirnoff() fails when trying to make a topology for popc. This fails whether using a mol2 file, SMILES or IUPAC name as the input structure.

This process works correctly using cholesterol.

Error is: Warning: popc: Failed due to unspecified stereochemistry

I suspect the problem might be something to do with the phosphate group

To Reproduce

Run the attached script interchange2.ipynb I have attached popc.mol2 and cholesterol.mol2

Output

Warning: popc: Failed due to unspecified stereochemistry

Computing environment (please complete the following information): Linux opensuse 15.5

Additional context

Remove .txt to use...

cholesterol.mol2.txt popc.mol2.txt interchange2.ipynb.txt

j-wags commented 3 weeks ago

Hi @dkchalmers, I think this is actually working with both of your molecules, the issue with popc is just emitting a warning (not an error). We might see about squelching the warning, but I think your notebook is running successfully. Could you try using the outputs and let me know if there's an issue?

dkchalmers commented 2 weeks ago

Hi @j-wags. Ahh - thanks very much. I can see that it does indeed work. I was mislead by the warning and also that generating the topology for POPC takes 10 times as long as for cholesterol. I timed the two - cholesterol takes 35 seconds, popc takes 392 s.

j-wags commented 2 weeks ago

Hah, I saw a similar comment on related repo yesterday. Looks like it's already been deleted.

Thanks for the report. The long charge assignment time kinda makes sense since it's a large flexible molecule. We have a new AM1BCC charge assignment tool (NAGL) in the works that should be much faster, so hopefully the long parameterization time won't be a problem for much longer.

I'll close this issue since it seems like the POPC is being parameterized correctly, but please feel free to reopen it if you find further problems.

dkchalmers commented 2 weeks ago

Thanks for your reply - although I can generate a topology for a single POPC molecule, more than one seems to make things much slower - to the point that I cant generate a topology for a file containing 2 POPC molecules and 1 water.

The attached parameterise.py script works for a file containing 2 x cholesterol and 1 water

./paramaterise.py bilayer_c_solv.pdb Making Topology Reading Force Field Assigning Force Field Warning: Cannot perform Hydrogen sampling with GPU-Omega: GPU-Omega disabled. Writing output files Run time: 109.9 seconds

However the same thing with 2 x POPC and 1 water does not complete in > 2 hours

./paramaterise.py bilayer_p_solv.pdb Making Topology ....

cholesterol.mol2.txt bilayer_p_solv.pdb.txt bilayer_c_solv.pdb.txt popc.mol2.txt paramaterise.py.txt

j-wags commented 2 weeks ago

Thanks for the example, I've reopened the issue and reproduced what you're seeing. The problem here appears to actually be in the Topology.from_pdb line. I think the long alkane chains make the template matching take forever because of all the symmetries.

Whole we work on fixing that, a quick workaround to your issue is to use the openmm.app.PDBFile object and the Topology.from_openmm API point.

import openmm.app
pdb = openmm.app.PDBFile("bilayer_p_solv.pdb")
top = Topology.from_openmm(pdb.topology, unique_molecules=[CHL, POPC, TIP3])

The downside to the Topology.from_openmm workaround is that it won't recognize proteins (that is, it can recognize whole unique molecules, but not proteins made of amino acids). So if your final topology was going to have a mix of proteins and lipids, you'd need to load the proteins using Topology.from_pdb, the lipids using Topology.from_openmm, and then combine them all using Topology.add_molecules.

(And again, I consider this failure of Topology.from_pdb to be a bug, and we'll work to resolve that so you don't need to use Topology.from_openmm moving forward)

dkchalmers commented 2 weeks ago

Thanks - a fix to this problem would be very helpful!

I tried to read my bilayer model using openmm.app.PDBFile - however, this does not work (error below). Although the error says 'attribute bonds', it seems to me that the 'pdb' object is missing bond-order information.

Script: parameterise.py.txt

./parameterise.py bilayer_p_solv.pdb Reading components Making Topology Traceback (most recent call last): File "parameterise.py", line 36, in top = Topology.from_openmm(pdb, unique_molecules=[CHL, POPC, TIP3]) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/david/bin/miniforge3/envs/openff/lib/python3.11/site-packages/openff/utilities/utilities.py", line 80, in wrapper return function(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "david/bin/miniforge3/envs/openff/lib/python3.11/site-packages/openff/toolkit/topology/topology.py", line 1410, in from_openmm for omm_bond in openmm_topology.bonds(): ^^^^^^^^^^^^^^^^^^^^^ AttributeError: 'PDBFile' object has no attribute 'bonds'

j-wags commented 2 weeks ago

Change

Topology.from_openmm(pdb, unique_molecules=[CHL, POPC, TIP3])

to

Topology.from_openmm(pdb.topology, unique_molecules=[CHL, POPC, TIP3])
dkchalmers commented 2 weeks ago

Thanks for your help - that is working.

I am also very interested in working with peptides and proteins with lipid bilayers - so a fix for the original issue would be great.

PEFrankel commented 2 weeks ago

The stereochemistry warning is common, but you likely don't need to worry about it here. Although I cannot see, but as you suspect, the issue arises from the phosphate, which we've run into as well. allow_undefined_stereo=True quenches the warning. In my experience making molecules and interchanges, SDFs work well with from_file(). from_pdb() can be a little finicky (with a few lipids at least iirc).

In case you find it useful, this notebook parametrizes both a single lipid and/or a full PDB system (128 lipids fully hydrated here) in 10-15 min (note: this example implements HMR). Adjusting the molecule count and box size don't change the parametrization duration. Changing the charge assignment to something other than AM1BCC would make it faster as Jeff noted.