michellab / BioSimSpace

Code and resources for the EPSRC BioSimSpace project.
https://biosimspace.org
GNU General Public License v3.0
77 stars 19 forks source link

Parameterising molecules with chains #93

Closed lohedges closed 3 years ago

lohedges commented 5 years ago

Consider the following protein. This has a single molecule with 13932 atoms and two chains. The BioSimSpace PDB parser can read it just fine:

import BioSimSpace as BSS

# Load the system.
s = BSS.IO.readMolecules("protein.pdb")
s.nMolecules()
1

# Get the only molecule.
m = s.getMolecules()[0]
m.nAtoms()
13932
m.nChains()
2

# Check that we preserve the chain info on write.
BSS.IO.saveMolecules("tmp", s, "pdb")
m = BSS.IO.readMolecules("tmp.pdb").getMolecules()[0]
m.nAtoms()
13932
m.nChains()
2

Now let's try to parameterise the molecule:

import BioSImSpace as BSS

# Get the only molecule in the system.
m = BSS.IO.readMolecules("protein.pdb").getMolecules()[0]

# Create a background process to parameterise the molecule.
p = BSS.Parameters.ff14SB(m, work_dir="tmp")

# Try to get the parameterised molecule. This fails with an IncompatibleError
mm = p.getMolecule()

# Load the tLEaP output directly into a new system.
s = BSS.IO.readMolecules(["tmp/leap.top", "tmp/leap.crd"])

# How many molecules do we have? The chains have been split into separate
# molecules.
s.nMolecules()
2

# Get the two molecules.
m0 = s.getMolecules()[0]
m1 = s.getMolecules()[1]

# How many atoms? Looks like there are four more in total,
# but it's basically the two chains.
m0.nAtoms() + m1.nAtoms()
13936

How do we preserve the chain information? I've looked at the code for Sire.IO.AmberPrm and I only see mention of treechains. Does the AMBER format have a concept of PDB style chains in its topology? (I've found some discussion here regarding mapping back to a PDB file from an AMBER trajectory.)

Unfortunately it's not as simple as stripping the TER records from the original PDB file, since the parser can detect separate chains by name too. I could probably modify the code to by able to run in a "no-chain" mode if needed, where it ignores the chain identifiers. I've tried doing this manually, but tLEaP still breaks the parameterised system into multiple molecules (albeit with different atom numbers) so it's obviously accounting for the missing information somehow with its template matching.

Instead I could try to modify the algorithm that matches atoms from the tLEaP generated output to those in the original molecule. However, this expects that you are comparing two molecules, rather than a molecule (the original) and a system of molecules (what tLEaP is producing). Would this even work when we are trying to map across properties that assume a single molecule, e.g. bonds, angles, etc. I'm not sure that these can easily be split, or recombined if we need to write back to a different format.

lohedges commented 5 years ago

I've been contacted again by the user who is experiencing this issue. Any input would be much appreciated.

ppxasjsm commented 5 years ago

This may be a silly questions, but this isn't just tleap adding two atoms on each of the chains?

lohedges commented 5 years ago

I think you're correct in stating that tLEaP is capping the chains. The issue is that the single molecule two-chain system has been converted into a two molecule system, where each molecule is one of the chains from the original molecule, which makes it incompatible with the original molecule that was passed in.

lohedges commented 5 years ago

Does anyone have any input on this? I've been contacted by the user multiple times and I'm still not able to provide them with an answer.

Although I'm not sure that the original input is properly prepped, i.e. hydrogens missing, the main problem still holds. How do we go from a two-chain single molecule representation in a PDB file to a no chain two-molecule representation by tLEaP. This breaks our assumption of the original molecular representation being the true one, i.e. what's generated by tLEaP is incompatible with it (not just in terms of atom names, but topology).

jmichel80 commented 5 years ago

Hi Lester,

Essentially the problem is that we assume that the input contains one molecule but it contains actually two molecules. It is difficult to rely on chain identifiers as a lot of different molecules also have the same chain label (often X for solvent etc...) and this is not done consistently in pdb files.

A general solution could inspect the connectivity of every atom to figure out in which molecule they belong but that would depend on being able to correctly infer connectivity from the information available.

A fix for this use case may be to get the user to load each chain separately and combining parameterised systems later.


Dr. Julien Michel, Senior Lecturer Room 263, School of Chemistry University of Edinburgh David Brewster road Edinburgh, EH9 3FJ United Kingdom phone: +44 (0)131 650 4797 http://www.julienmichel.net/

On Fri, Sep 6, 2019 at 10:40 AM Lester Hedges notifications@github.com<mailto:notifications@github.com> wrote:

Does anyone have any input on this? I've been contacted by the user multiple times and I'm still not able to provide them with an answer.

Although I'm not sure that the original input is properly prepped, i.e. hydrogens missing, the main problem still holds. How do we go from a two-chain single molecule representation in a PDB file to a no chain two-molecule representation by tLEaP. This breaks our assumption of the original molecular representation being the true one, i.e. what's generated by tLEaP is incompatible with it (not just in terms of atom names, but topology).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/michellab/BioSimSpace/issues/93?email_source=notifications&email_token=ACZN3ZC7H2D4JMU3MIV7AILQIIQRPA5CNFSM4H5DQGM2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6CKQEY#issuecomment-528787475, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACZN3ZF47DL7XCVAEBNNABTQIIQRPANCNFSM4H5DQGMQ.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.