michellab / Sire

Sire Molecular Simulations Framework
http://siremol.org
GNU General Public License v3.0
95 stars 26 forks source link

Tracking down `invalid_key` error in Waterswap from `zmatrixmaker.applyTemplates` #428

Closed ljmartin closed 1 year ago

ljmartin commented 1 year ago

Hi all - thanks for making this great software available.

I'd like to try out waterswap, and have installed it in a container using the instructions here. It seems to run happily on example inputs given in a tutorial (here), but on my own inputs, generated via OpenMM and OpenMMForceFields, it can't generate a Z matrix for something.

I guess this is because my residues don't follow the naming scheme for the file amber.zmatrices. However, I can't figure out which residue is incorrect. The error is:

Traceback (most recent call last):
  File "/home/openbiosim/.conda/envs/openbiosim/share/Sire/scripts/waterswap.py", line 205, in <module>
    WSRC.run(params)
  File "/home/openbiosim/.conda/envs/openbiosim/lib/python3.9/site-packages/sire/legacy/Tools/__init__.py", line 175, in inner
    retval = func()
  File "/home/openbiosim/.conda/envs/openbiosim/lib/python3.9/site-packages/sire/legacy/Tools/WSRC.py", line 4167, in run
    (wsrc_system, wsrc_moves) = loadWSRC()
  File "/home/openbiosim/.conda/envs/openbiosim/lib/python3.9/site-packages/sire/legacy/Tools/WSRC.py", line 3905, in loadWSRC
    proteinsys = addFlexibility(
  File "/home/openbiosim/.conda/envs/openbiosim/lib/python3.9/site-packages/sire/legacy/Tools/AmberLoader.py", line 655, in addFlexibility
    protein_mol = zmat_maker.applyTemplates(protein_mol)
KeyError: 'SireError::invalid_key: H (call sire.error.get_last_error_details() for more info)'

I can guess is that there's an atom or a residue called H that shouldn't be. There aren't any residues called H (I checked with parmed), and atom H is, I think, a backbone amide hydrogen atom so shouldn't be a problem ordinarily.

I can reach this stage in an IPython shell, but I can't call sire.error.get_last_error_details() for more info because Waterswap runs within the old API and this: sire.error. must be in the new API, since it complains about trying to run modules from two different APIs.

Zmatrixmaker is in C++ so I can't 'debug' further. Is there any other way I can track down what the residue is? Also, what forcefield is amber.zmatrices? TY!

chryswoods commented 1 year ago

Hi - thanks for your kind words :-)

The amber.zmatrices file is a z-matrix file. This file doesn't specify a forcefield. Instead, it tells waterswap how to perform the Monte Carlo moves that are used to sample the system.

Specifically, it tells waterswap how to perform the Monte Carlo moves on the protein molecules. (it doesn't need to know how to move the waters, as these are only translated and rotated rigidly, and it uses an algorithm to automatically generate the Monte Carlo moves for small molecules like ligands or cofactors - these algorithms don't work for proteins).

Every protein residue is divided into a backbone (containing bbatoms) and sidechain (containing scatoms). The sidechains are the same regardless of the position of the residue in the protein (e.g. C-terminus, N-terminus or middle).

The backbone atoms - which are the C, N, O and connected hydrogens of the backbone - are specified as "chain" templates, e.g. chain aamiddle describes how general amino acid backbone atoms would move for residues in the middle of the protein, while chain aanterm does the same for backbone atoms in the residue at the N-terminus.

The residue templates, e.g. residue ALA say how to move the sidechain atoms of the residues. For example, residue ALA gives the names of the chain templates used for its backbone atoms at different chain positions, then it gives the z-matrix that defines how each of the sidechain atoms are built and moved from those backbone atoms.

The error message shows that an atom called H exists in one of your protein residues which isn't expected based on the atoms listed for that residues residue and chain (backbone) template.

Typically I have seen these errors if the protein residues don't use standard (PDB-style) atom names, or if the groups on the C- or N- terminals of the protein aren't specified in the amber.zmatrices file.

The question to answer are;

  1. What are the terminal residues of your protein? Only NME, ACE or standard amino acids are current specified in the amber.zmatrices file

  2. How are the hydrogen atoms named in your protein? You need to match the names in the templates, e.g.

residue ALA
rigidbody rotate 2.000 translate 0.050
bbatom N
bbatom CA
bbatom O
bbatom C
backbone first aanterm middle aamiddle last aacterm single aasingle
zmatrix CB CA N C
zmatrix HB1 CB CA N
zmatrix HB2 CB CA HB1
zmatrix HB3 CB CA HB1
angle CB CA N flex 0.50
angle HB1 CB CA flex 0.50
angle HB2 CB CA flex 0.50
angle HB3 CB CA flex 0.50
dihedral HB1 CB CA N flex 10.00

shows that the sidechain hydrogens in ALA need to be called HB1, HB2 and HB3.

For terminal residues, the naming also needs to be precise, e.g. here's the templates for NME

chain nme
zmatrix HH31 CH3 N H
angle HH31 CH3 N flex 0.50
dihedral HH31 CH3 N H flex 15.0

residue NME
rigidbody rotate 2.000 translate 0.050
bbatom H
bbatom N
bbatom CH3
bbatom HH31
backbone first nme middle nme last nme single nme
zmatrix HH32 CH3 N HH31
zmatrix HH33 CH3 N HH31
angle HH32 CH3 N flex 0.50
angle HH33 CH3 N flex 0.50

showing that the hydrogens need to be called H, HH31, HH32 and HH33.

I'm sorry that waterswap isn't more helpful in its error message. Unfortunately there won't be any more useful details, even if get_last_error_details() worked for the old API. This is because those details give a backtrace and thread information, which historically confused more people than they helped (hence why they are now hidden). The backtrace won't give any more info about which residue has the wrong atom.

The way to debug this is to look for the H atoms in your protein and make sure that they appear in the chain and residue template for the associated residue. Start with the terminal residues first, as these are normally the source of the problem.

I will say though that waterswap is quite an old program (2013). It is based on Monte Carlo and so can't be GPU-acclerated. It will likely be a lot slower than a GPU-accelerated MD code and free energy method, and the free energies will not be as high a quality as a more modern metadynamics or other biased dynamics method (e.g. you may have more success with alchemical transfer methods, e.g. https://github.com/Gallicchio-Lab/AToM-OpenMM).

ljmartin commented 1 year ago

Hi Chris, Thanks so much for that response! You read my mind a bit - watching the development of AToM-OpenMM reminded me of its Monte Carlo cousin, Waterswap. While WS is CPU-bound, I'm still very interested to compare.

I think you'll be right about the terminal residues. OpenMM can run this, apparently using ff14SB:

app.ForceField('amber/protein.ff14SB.xml', 'amber14/tip3p.xml')

which is why I asked about forcefields - perhaps the different AMBER forcefields have different naming conventions? But then again I think OMM uses some kind of graph matching to get around differences in naming conventions.

This isn't a bug so I'll close it shortly and post an update once I track down the issue. In the mean time: are you aware of an exemplary pipeline to prep protein+ligand complexes for waterswap using PDBFixer + GAFF? Otherwise, I will wrestle with how to use tleap :)

ljmartin commented 1 year ago

OK tleap got me over the line. In case it helps anyone, the code is below.

I think the issue was a chain gap - apparently still there's no robust way to handle this so it's usually done manually. I added a TER after the chain gap in my protein system - IN EMACS Then:

reduce -BUILD prot.pdb > prot.reduce.pdb
pdb4amber -i prot.reduce.pdb -a -o prot.amb.pdb

antechamber -i lig.pdb -fi pdb -o ligand.mol2 -fo mol2 -c bcc -nc 1

parmchk2 -i ligand.mol2 -f mol2 -o ligand.frcmod

tleap -f tleap.in

with tleap.in:

#load forcefields
source leaprc.protein.ff14SB
source leaprc.gaff
source leaprc.water.tip3p

#load solutes
loadamberparams ligand.frcmod
protein = loadpdb prot.amb.pdb
ligand = loadmol2 ligand.mol2

# Combine them
complex = combine {protein ligand}

solvateoct complex TIP3PBOX 12.0

addIons complex Cl- 0

# Check for issues and save files
check complex
saveamberparm complex tleapcomplex.prmtop tleapcomplex.inpcrd

quit

et voila:

import sire
sire.use_old_api()

from Sire.Tools import WSRC
from Sire.Tools import readParams

params = {}
params["protein crdfile"] = './tleapcomplex.inpcrd'
params['protein topfile'] = './tleapcomplex.prmtop'
params['ligand name'] = 'LIG'
params["fast simulation"] = True
params['nmoves'] = 100
WSRC.run(params)
chryswoods commented 1 year ago

Thanks - and thank for posting the solution for others. Feel free to get back in touch if you have any other problems :-)