michellab / BioSimSpace

Code and resources for the EPSRC BioSimSpace project.
https://biosimspace.org
GNU General Public License v3.0
77 stars 19 forks source link

Issues with using Amber files in a GROMACS simulation #279

Closed mattaq31 closed 2 years ago

mattaq31 commented 2 years ago

Hi,

I'm attempting to use BioSimSpace to simulate an amber system which I obtained from this paper (supplementary files -> ligand 3K, complex) using GROMACS. To do this, I ran the following set of commands:

import BioSimSpace as BSS

system = BSS.IO.readMolecules(BSS.IO.glob("../ligand_3k_start/SYSTEM*")) # reading all SYSTEM files from paper (.pdb, .crd, .top)

protocol = BSS.Protocol.Minimisation(steps=1000) # the type of process used doesn't seem to make a difference

process = BSS.Process.Gromacs(system, protocol) # errors here

When I attempt to create the process (last line above), the system errors out with the following output:

RuntimeError: Unable to generate GROMACS binary run input file.

'gmx grompp' reported the following errors:
ERROR 1 [file gromacs.top, line 46974]:
  virtual site HG (Res SER-6) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 2 [file gromacs.top, line 46974]:
  virtual site HG (Res SER-13) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 3 [file gromacs.top, line 46974]:
  virtual site HG (Res SER-23) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 4 [file gromacs.top, line 46974]:
  virtual site HG (Res SER-32) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 5 [file gromacs.top, line 46974]:
  virtual site HG (Res SER-35) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 6 [file gromacs.top, line 46974]:
  virtual site HG1 (Res THR-41) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 7 [file gromacs.top, line 46974]:
  virtual site HH (Res TYR-48) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 8 [file gromacs.top, line 46974]:
  virtual site HG1 (Res THR-56) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 9 [file gromacs.top, line 46974]:
  virtual site HG (Res SER-68) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 10 [file gromacs.top, line 46974]:
  virtual site HG1 (Res THR-70) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 11 [file gromacs.top, line 46974]:
  virtual site HH (Res TYR-72) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 12 [file gromacs.top, line 46974]:
  virtual site HG (Res SER-80) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 13 [file gromacs.top, line 46974]:
  virtual site HH (Res TYR-86) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 14 [file gromacs.top, line 46974]:
  virtual site HH (Res TYR-91) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 15 [file gromacs.top, line 46974]:
  virtual site HG (Res SER-113) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 16 [file gromacs.top, line 46974]:
  virtual site HH (Res TYR-115) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 17 [file gromacs.top, line 46974]:
  virtual site HG1 (Res THR-126) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 18 [file gromacs.top, line 46974]:
  virtual site HG (Res SER-129) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 19 [file gromacs.top, line 46974]:
  virtual site HH (Res TYR-135) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 20 [file gromacs.top, line 46974]:
  virtual site HG1 (Res THR-140) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 21 [file gromacs.top, line 46974]:
  virtual site HG1 (Res THR-148) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 22 [file gromacs.top, line 46974]:
  virtual site HG1 (Res THR-150) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 23 [file gromacs.top, line 46974]:
  virtual site HG (Res SER-159) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 24 [file gromacs.top, line 46974]:
  virtual site HG (Res SER-177) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 25 [file gromacs.top, line 46974]:
  virtual site HG1 (Res THR-178) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 26 [file gromacs.top, line 46974]:
  virtual site HG1 (Res THR-183) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 27 [file gromacs.top, line 46974]:
  virtual site HH (Res TYR-191) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 28 [file gromacs.top, line 46974]:
  virtual site HG (Res SER-206) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 29 [file gromacs.top, line 46974]:
  virtual site HG (Res SER-214) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 30 [file gromacs.top, line 46974]:
  virtual site HH (Res TYR-221) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 31 [file gromacs.top, line 46974]:
  virtual site HG (Res SER-227) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 32 [file gromacs.top, line 46974]:
  virtual site HH (Res TYR-238) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 33 [file gromacs.top, line 46974]:
  virtual site HH (Res TYR-241) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 34 [file gromacs.top, line 46974]:
  virtual site HG1 (Res THR-242) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 35 [file gromacs.top, line 46974]:
  virtual site HG (Res SER-275) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 36 [file gromacs.top, line 46974]:
  virtual site HG1 (Res THR-280) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 37 [file gromacs.top, line 46974]:
  virtual site HG (Res SER-287) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 38 [file gromacs.top, line 46974]:
  virtual site HH (Res TYR-288) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 39 [file gromacs.top, line 46974]:
  virtual site H1 (Res WAT-1) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.
  ERROR 40 [file gromacs.top, line 46974]:
  virtual site H2 (Res WAT-1) has non-zero mass 1.008 (state A) / 1.008
  (state B)
  Check your topology.

'gmx grompp' reported the following warnings:
WARNING 1 [file gromacs.top, line 46974]:
  Molecule type 'WAT' has 3 constraints.
  For stability and efficiency there should not be more constraints than
  internal number of degrees of freedom: 0.

Use 'ignore_warnings' to ignore warnings.

I have also tried other ligands in the paper, which resulted in similar errors as above. Outputting the simulation files directly using BSS.IO.saveMolecules("thrombin_system", system, ["Gro87", "GroTop"]) and subsequently attempting to run grompp from command line also results in the same errors.

Is this a bug or is there something wrong with my procedure/setup? Thanks for your help!

Context

lohedges commented 2 years ago

Looking at the supplementary files in jp6b03296_si_002.zip I think this is because you are reading AMBER format topologies that are intended to be used for RBFE simulations using SOMD. As such, these will contain massed dummy atoms that are part of the alchemical transformation. Atoms of this type are not allowed by GROMACS for regular (non-FEP) simulations, hence the error.

If you want to reproduce the simulations using GROMACS, then you will need the original input topologies for the ligands and protein and the mapping that was used for the perturbation, i.e. the files that were used to create the input for SOMD, rather than the input to SOMD itself. With this you will be able to construct GROMACS format FEP topologies that should work as expected.

(The link that you provided referenced supporting information rather than supplementary files and I didn't see anything with a file path that matches the one used in your code snippet within the archive.)

Thanks for the clear issue and additional context information. Much appreciated.

jmichel80 commented 2 years ago

Although it is odd that the warnings are for protein atoms. We did this work several years ago, I would advise setting up the systems from scratch if you want to replicate the findings of the study. You could extract the protein & ligand coordinates so you have a full all-atom model and use BioSimSpace to parameterise, solvate, equilibrate and generate FEP inputs.

lohedges commented 2 years ago

Yes, I agree that it's confusing. This could be due to loading all the files in the directory, i.e. the PDB, CRD, and top, so information from the PDB file may have overwritten values from the TOP file once the system was set up. Sire simply works out which files can be used to generate the molecular system (lead parse) then adds any extra information from the other parsers (followers) as long as the topology is consistent.

mattaq31 commented 2 years ago

Thanks both for your replies.

I see, then it might be best to simply begin from the protein and ligand positions as you've said then solvate from scratch.

Re. file path, apologies for not being more specific, the set of files I was referring to was: URL: https://pubs.acs.org/doi/suppl/10.1021/acs.jpcb.6b03296/suppl_file/jp6b03296_si_002.zip Internal path: Systems/3Series/3K~3B/Complex/core

Thanks again!

ppxasjsm commented 2 years ago

All the above makes sense. Maybe using this Thrombin dataset might be easier: https://github.com/openforcefield/protein-ligand-benchmark/tree/master/data/2019-09-23_thrombin

jmichel80 commented 2 years ago

I think that dataset is different from the one we studied with Gaetano, but it wouldn't be too hard to add more ligands if you want to study that particular series. Note that there is a flexible loop with a Tryptophan residue near the binding site that was annoying to model at least when we looked at it.

mattaq31 commented 2 years ago

Hi all, not sure if I'm doing something wrong, but I tried to parametrise the thrombin files using BioSimSpace, with no luck:

Thanks for your help and apologies for the confusion! I've uploaded the thrombin + ligand file here: base_protein_ligand.zip

lohedges commented 2 years ago

Hi there,

Thanks for reporting. In order to extract the parameterisation errors you will need to execute the parameterisation in a specified working directory. All intermediate files will be written here, along with a README containing the list of commands that were run in order. For example:

import BioSimSpace as BSS

# Load your combined protein + ligand PDB file.
system =  BSS.IO.readMolecules("base_protein_ligand.pdb")

# Extract the constituents for convenience.
protein = system[0]
ligand = system[1]

# Try to parameterise the protein.
protein = BSS.Parameters.ff99SB(protein, work_dir="protein_param").getMolecule()

This errors with:

ParameterisationError: Parameterisation failed! Last error: 'tLEaP failed!'

We can now look in the protein_param directory:

ls protein_param
leap.err  leap.log  leap.out  leap.pdb  leap.top  leap.txt  README.txt

The README.txt has the command that was run:

cat protein_param/README.txt
# tLEaP was run with the following command:
/home/lester/sire.app/bin/tleap -f leap.txt

Looking at the tleap log file:

cat protein_param/leap.log
...
/home/lester/sire.app/bin/teLeap: Warning!
The unperturbed charge of the unit (6.000000) is not zero.
FATAL:  Atom .R<LIG 1>.A<C1 1> does not have a type.
FATAL:  Atom .R<LIG 1>.A<N1 2> does not have a type.
FATAL:  Atom .R<LIG 1>.A<O1 3> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C2 4> does not have a type.
FATAL:  Atom .R<LIG 1>.A<N2 5> does not have a type.
FATAL:  Atom .R<LIG 1>.A<O2 6> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C3 7> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C4 8> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C5 9> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C6 10> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C7 11> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C8 12> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C9 13> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C10 14> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C11 15> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C12 16> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C13 17> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C14 18> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C15 19> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C16 20> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C17 21> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C18 22> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C19 23> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C20 24> does not have a type.
FATAL:  Atom .R<LIG 1>.A<C21 25> does not have a type.
FATAL:  Atom .R<LIG 1>.A<CL1 26> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H1 27> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H2 28> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H3 29> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H4 30> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H5 31> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H6 32> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H7 33> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H8 34> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H9 35> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H10 36> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H11 37> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H12 38> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H13 39> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H14 40> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H15 41> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H16 42> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H17 43> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H18 44> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H19 45> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H20 46> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H21 47> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H22 48> does not have a type.
FATAL:  Atom .R<LIG 1>.A<H23 49> does not have a type.
FATAL:  Atom .R<ILE 2>.A<H1 20> does not have a type.
FATAL:  Atom .R<ILE 2>.A<H2 21> does not have a type.
FATAL:  Atom .R<ILE 2>.A<H3 22> does not have a type.

/home/lester/sire.app/bin/teLeap: Fatal Error!
Failed to generate parameters

Exiting LEaP: Errors = 1; Warnings = 4; Notes = 0.

This is because your PDB file contains atom names that don't match the required tleap templates. You will need to pre-process the file or rename them manually.

Similar errors occur when trying to parameterise the ligand.

We have crude support for pre-processing PDB files with the pd4amber tool prior to reading, but this still requires somewhat standard formatting. If you split your PDB into separate protein and ligand files you could try doing:

protein = BSS.IO.readPDB("protein.pdb", pdb4amber=True)

This might clean up some of the janky formatting. Alternatively, you could run the tool yourself from the command-line so you can access all of its options.

The reason we don't try too hard to clean files ourselves is that it requires a lot of system specific knowledge and it is very hard to generalise. We don't want to make invalid assumptions, so assume that the user best knows what their files should represent.

I hope this helps.

mattaq31 commented 2 years ago

Thanks @lohedges for the detailed reply!

That debugging step with 'workdir' worked great, I now have a much better view at what's going on. I'll try the pdb4amber tool that you've described and see how I get on.

There are a couple of things I am still not 100% on:

Thanks again for your help!

Screenshot 2022-02-15 at 15 54 54

lohedges commented 2 years ago

No problem. It's one of things that's hard to design perfectly with BioSImSpace, i.e. keeping things as simple as possible while still allowing users to access the full set of intermediate files and commands for debugging. It's also hard to give good generic high-level error messages since they are so varied and often require a lot of technical understanding to parse.

Typically, PDB files should only be used for single molecule topologies, i.e. as in the original spec. However, many people combine molecules with a loose standard being a single standalone TER separator between molecules. (This is often used for waters, for example.) This is how we split multi-molecule PDB files in BioSimSpace. Note that TER records normally are used to indicate the end of molecular chains, but these have slightly different formatting, i.e. not just TER by itself. BioSImSpace will still treat these records correctly. If you format your PDB file in this way, then BioSImSpace should split it as you intend. I generally recommend keeping separate PDB files for each molecule that you are interested in. It's usually more flexible and it's easier to share with coleagues, etc., since that's what the PDB is designed for.

If the parametrization works well, how would you recombine the protein and ligand together again to create a single system for solvation by BSS?

BioSimSpace has powerful functionality for combining molecules in a Pythonic way. For example, once you have parameterised protein and ligand molecules you can just do:

molecules = protein + ligand

This will give you a Molecules container, which can be converted to a System object by using the .toSystem() method, i.e.

complex = molecules.toSystem()

# Alternatively, in one go.
complex = (protein + ligand).toSystem()

(Most functions in BioSimSpace can handle Molecules in place of a System, with the conversion taking place internally.)

lohedges commented 2 years ago

For reference, PDB is one of the oldest and most abused file formats, so many input files are poorly formatted or have lots of quirks and inconsistencies. Despite it's relative simplicity, it's quite hard to parse reliably because of this. (Well, parsing is okay, but actually inferring what the user intended isn't.)

mattaq31 commented 2 years ago

Hi @lohedges just wanted to thank you again for your help here - your explanations have been very helpful for me to understand what's going on in BSS under the hood.

Given that the problems I've had are more related to issues with the input files rather than BSS itself, can this Github issue now be closed?

lohedges commented 2 years ago

No problem, I'm glad the information was of help. As you have found, a lot of stuff is going on behind the scenes. Once you get used to things you can normally dig down to get a better insight as to what might have gone wrong.

Yes, please feel free to close if you are satisfied that things are resolved.

Cheers.