Specify ions during system setup

cespos commented 2 years ago

Is your feature request related to a problem? Please describe. Hi! I could not find the option to specify the ions to use during system preparation, what in tleap would be addIons2 or pname/nname for gmx genion... Is there already a feature or would it be possible to add it?

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

jmichel80 commented 2 years ago

Currently BioSimSpace.Solvent.solvate will solvate water box with monoatomic (Na+ , Cl-) ions using parameter sets consistent with the water model chosen during parameterisation. See https://biosimspace.org/api/index_Solvent.html Does your use case require a different background concentration of ions? You could also try parameterising ions as 'structural ions' during the parameterisation state (https://biosimspace.org/api/index_Parameters.html).

cespos commented 2 years ago

Thanks for your reply.

Yeah, sometimes I would need to use Mg2+... But it's okay if it is not possible to have this feature. One can always prepare the system "manually" and use BSS starting from coordinate and parameter files.

About the structural ions, I have tried to prepare a system containing a structural calcium ion but I would always get an error. I will try to reproduce the error and post it in a separate issue.

lohedges commented 2 years ago

Hi there, as @jmichel80 says we currently use monatomic ions for solvation and allow the user to specify consistent ion parameters when parameterising in the presence of structural ions. Please let us know the issue you are seeing with a structural calcium. I'd be happy to debug.

cespos commented 2 years ago

Hi @lohedges, Hi @jmichel80

I was able to reproduce the error with the calcium ion. See the figure below. I am also attaching here the test pdb file so that you are able to reproduce the error.

Instead, if I use tleap directly with the following configuration file, I get no error and the calcium ion is correctly parametrized.

source leaprc.protein.ff14SB
set default PBRadii mbondi3
source leaprc.water.tip3p

x = loadpdb 1oxr_prot_prep3.pdb

saveamberparm x tmp.prmtop tmp.inpcrd
quit

I thought there was a problem when loading the water model, so I tried loading the water manually with leap_commands but I also got an error

PS. In the current amber parameter functions, are PBRadii adjusted for the different Amber force fields, e.g. set default PBRadii mbondi3 for ff14SB or set default PBRadii mbondi2 for ff99SB?

1oxr_prot_prep3.txt

lohedges commented 2 years ago

Hi there,

It appears that it works if you use the original PDB file, but not the one written by BioSimSpace (using its internal PDB parser) during the parameterisation. Looking at the files, the difference is in the TER record for ATOM 1731 before the Ca ion. Although this is being parsed correctly on read, it isn't preserved properly on write, leading to the error by LEaP. To see this you can do:

import BioSimSpace as BSS

# Load the molecule.
m = BSS.IO.readMolecules("1oxr_prot_prep3.pdb")[0]

# Create a parameterisation process.
p = BSS.Parameters.ff14SB(m, water_model="tip3p", work_dir="tmp")

# Try to get the parameterised moelcule.
par_m = p.getMolecule()

...
ParameterisationError: Parameterisation failed! Last error: 'tLEaP failed!'

Now, if you move the the tmp directory you can see all intermediate files and a README.txt with the commands that were run. Copying your original PDB file over the intermediate leap.pdb (written by BSS) and re-running works:

cd tmp
cp ../1oxr_prot_prep3.pdb
tleap -f leap.txt

I'll try to figure out why the TER record isn't being preserved and will post an update. (I haven't seen this before and have dealt with multi-chain PDBs previously without issue.)

PS. In the current amber parameter functions, are PBRadii adjusted for the different Amber force fields, e.g. set default PBRadii mbondi3 for ff14SB or set default PBRadii mbondi2 for ff99SB?

No, currently not, so assume that the default is used in all cases. Let me know what the preferable behaviour is and I can update accordingly. I'm not sure if this would need to be coupled appropriately to AMBER config parameters at runtime, although this is tricky to do in general if the AMBER files were generated elsewhere.

Cheers.

lohedges commented 2 years ago

Okay, I've figured it out. The issue is because the calcium ion has the same chain identifier, despite coming after the TER record in the PDB. When the PDB is parsed into an internal Sire molecule we rely on the chain identifiers to determine which atoms are part of a chain, and which atom is the terminal atom. In this case, since the CA atom (number 1732) is the last atom with chain identifier A, then it is assumed to be the terminal atom in the chain and the TER is placed after it. Simply deleting the A identifier for the CA makes things work correctly.

I imagine that this is a common issue, and also assume that ions of this type will likely be labeled as HETATM rather than ATOM. I'll try to figure out what is standard so that we can come up with a workaround. (If I recall, HETATM are usually placed at the end, rather than the "position" at which they occur in the residue. I've previously had issues where the numbering is messed up by the presence of HETATM records, i.e. it is non sequential.)

lohedges commented 2 years ago

I've made a few tweaks to preserve HETATM records and make sure that the TER entry has the correct number. (This doesn't solve this issue, but was an error that I spotted.) Looking at the PDB standard for TER records here, it sounds (to my reading) like the TER record should come after the chain (any ATOM/HETATM block in a chain), so should be after the final HETATM, as we write. Given the level of abuse to the PDB format, I would not be at all surprised if this was seldom upheld. I guess the important thing is what LEaP expects.

I'll take a look at some files from the PDB to see how variable the formatting is.

cespos commented 2 years ago

Hi @lohedges, sorry for taking so long to reply. Usually the TER indicates the terminal protein residue. So, if you have multiple chains, you will have a TER at the end of each protein chain. These TER will instruct simulation software to not expect the terminal residue to be connected with what comes afterwards. In PDB files for simulations, all HETATM (ligands, ions, water molecules) are usually written after the protein atoms, still in the chain order HETATM A, HETATM B,.... You can have a TER after HETATM blocks but I do not think that will be relevant/read at all by the simulation software.

cespos commented 2 years ago

About the PBRadii parameter, it was more of a curiosity. This is only relevant when a Generalized Born model is used (igb != 0 in amber). I think that expert users will manually set this up anyway and they should be able to do it with leap_commands.

lohedges commented 2 years ago

Thanks for the comments. With regards to the TER / HETATM, I've looked at some of our example files and a few from the PDB and annoyingly the formatting is fairly inconsistent. I'll try to follow the guideline that you suggest, i.e. placing HETATM records after the TER. At present we take a Sire molecular structure and insert TERs after the last atom in a chain. I'll just need to update the logic to walk backwards up the chain whenever the last atom is a HETATM record.

Cheers.

lohedges commented 1 year ago

I've looked more at how HETATM appear in files from the PDB and there seems to be no standard, so I don't think our modified approach is valid, since it will fail in many cases. In some cases the records come after the terminal atom and TER record for a protein chain, with the HETATM entries having the chain identifier, in other cases they come after the ATOM records, but before the TER, in other cases they are completely interspersed with the ATOM records within a chain. I think these differences will be impossible to handle in a consistent way.

I guess the important thing is what LEaP expects, but I don't know how to find this information (without looking at the source code).

lohedges commented 1 year ago

For example, in this there a HETAMs with the same chain identifier before and after the TER. Some examples of the different formatting:

HETATM in chain B before and after TER, followed by HETATMs from chain A.

...
ATOM   2097 HD11 ILE B  36      -7.894   6.751 -22.957  1.00 52.74           H
ATOM   2098 HD12 ILE B  36      -8.945   7.001 -24.122  1.00 52.74           H
ATOM   2099 HD13 ILE B  36      -8.598   5.518 -23.670  1.00 52.74           H
HETATM 2100  N   NH2 B  37      -7.355   7.417 -29.288  1.00 58.31           N
TER    2101      NH2 B  37
HETATM 2102 ZN    ZN B 101       0.000   0.000  -9.201  0.33 15.72          ZN
HETATM 2103  O  AHOH A 201     -30.782  29.811 -17.433  0.50 20.93           O
HETATM 2104  O  BHOH A 201     -30.377  31.224 -16.358  0.50 18.33           O
HETATM 2105  O   HOH A 202     -10.750  28.703 -23.497  1.00 39.82           O
...

ATOM and HETATM interspersed within the same chain.


...
HETATM 2006  HEABXCP B  31      -6.322  12.783 -15.760  0.37 16.94           H
HETATM 2007  HA AXCP B  31      -6.311  10.105 -16.572  0.63 23.43           H
HETATM 2008  HA BXCP B  31      -5.758  10.612 -16.628  0.37 19.64           H
ATOM   2009  N  AHIS B  32      -5.542  10.707 -18.873  0.63 18.98           N
ANISOU 2009  N  AHIS B  32     1967   2279   2965    480   -109    265       N
ATOM   2010  N  BHIS B  32      -5.238  10.930 -18.956  0.37 18.62           N
ANISOU 2010  N  BHIS B  32     1887   2264   2926    494    -76    294       N
ATOM   2011  CA AHIS B  32      -4.988  11.199 -20.158  0.63 21.40           C
...

lohedges commented 1 year ago

For what it's worth, changing the chain identifier of the HETATM from A to B, i.e.

ATOM   1730 HD22 ASN A 120      27.964   7.481  -4.330  1.00 40.77           H
TER    1731      ASN A 120
HETATM 1732 CA    CA A 201      16.356  15.528   0.641  1.00  3.39          CA2+

to

ATOM   1730 HD22 ASN A 120      27.964   7.481  -4.330  1.00 40.77           H
TER    1731      ASN A 120
HETATM 1732 CA    CA B 201      16.356  15.528   0.641  1.00  3.39          CA2+

gives an indentical topology file output from tLEaP, and allows the file to be parsed correctly using the old Sire PDB parser.

I'll have a think about how best to proceed. I imagine a lot of other PDB parsers simply read/write the data as is. In our case, we are parsing into an internal molecular representation, where atoms are added to residues, which are reparented to chains, which are reparented to molecules. On write, we are reconverting this structure back to PDB records, so the chain identifiers are essential, and the TER can only be identified by the position of the last atom in the chain. I can't think of a way that we could round-trip such a funky ATOM/HETAM layout as the example shown above without using some additional data structures specific to PDB format, i.e. mapping the Sire molecule back to the PDB somehow. However, this would be lost on write/read to other formats, so isn't really a good solution.

cespos commented 1 year ago

Thanks @lohedges for looking into this. You are right, unfortunately there is not a unique way to write PDB files. With I colleague of mine we also saw that some system preparation tools assign the same residue number to crystallographic water molecules which are then followed by a TER. The parsing will remove the TER if the water molecules belong to the same chain and tleap will fail (see figures below). A solution here would be either to not remove the TER or to make sure that water molecules have consecutive residue IDs.

Anyway, is there a way to disactivate BSS PDB preprocessing?

lohedges commented 1 year ago

Hi @cespos, I have been working on some improvements to our PDB parser, particularly in terms of the handling of HETATM and TER records. I am currently testing my edits and a discussion thread is here.

Anyway, is there a way to disactivate BSS PDB preprocessing?

This is something that has come up in discussion, i.e. should we just assume a valid PDB file as input and pass this directly to the parameterisation engine, e.g. tLEaP. This would likely avoid a lot of these issues, but the user would no longer be able to cross-reference later simulation data with the original PDB, e.g. if they want to check some property of a chain with a specific identifier.

With regards to your specific issue, I think my most recent edits might solve this. Once I've finished my own testing, I'd be happy to try it using your file if you are able to share.

Cheers.

michellab / BioSimSpace

Specify ions during system setup #369