marrink-lab / vermouth-martinize

Describe and apply transformation on molecular structures and topologies
Apache License 2.0
90 stars 40 forks source link

Missing Atom and PTM_Atom Assignment #383

Closed mbrosz closed 2 years ago

mbrosz commented 2 years ago

Dear martinize2-developer team,

my name is Matthias and I want to use the martinize2 python package to coarse-grain a protein based on the amber ff. For this purpose, I created two new folders in the "/vermouth/data/force_fields/" directory (1 for AA amber ff, 1 for the Martini 3 ff). Next I created a new folder "martini300C" in the "/vermouth/data/mappings/martini30/" directory which stores the mapping for each amino acid The amino acids are named sth like "xyz.amber.map". Within each text file I adjust the [molecule],[from],[to],[mapping].etc. sections Since in amber the protein is capped, I also adjusted the "modifications.mapping" file to introduce the caps ACE and NME.

After performing all these adjustments, I used the following command to martinize

martinize2 -f Pro.pdb -x Pro-CG.pdb -o system.top -merge A,B,C -sep -nter ACE -cter NME -from amber -ff martini300C From this, I obtain a nice coarse-grained structure for a protein. All good so far!

However, in the next step I wanted to scale-up and not only coarse-grain one single protein, but also coarse-grain a larger and more complex protein containing multiple small proteins (separated by TER in the PDB). On top, the larger protein contains some "special" amino acids (here I call them NAA), which I introduced like the caps before. (As a site note, I did not enter them in the "modifications.mapping" file.)

Next I used the command from above to martinize the larger protein and ran in the following issues:

1) Number of molecules after guessing bonds are not in agreement with molecules guessed from PDF file.

2) Two (I assume) interconnected errors/warnings.

2.1) INFO - step - Repairing the graph. INFO - general - Applying modification ACE to residue A-ACE182 INFO - general - Applying modification NME to residue A-NME415 INFO - missing-atom - Missing atom HYP184:N INFO - missing-atom - Missing atom HYP184:OD1 INFO - missing-atom - Missing atom GLY185:O INFO - missing-atom - Missing atom ALA186:O INFO - missing-atom - Missing atom ALA187:CB INFO - missing-atom - Missing atom GLY188:N INFO - missing-atom - Missing atom ALA189:CB ... this goes on for quiet a while.

The residues in C.pdb are "ACE-NAA-HYP-GLY-ALA-....-NME:. So may be this originates from the new amino acid I introduced: NAA ?

2.2) The next error I encounter may originat from 2.1:

INFO - general - Identified the modifications ['ACE'] on residues ['ACE182', 'ACE182', 'ACE182', 'NAA183'] WARNING - unknown-input - Could not identify the modifications for residues ['NAA183', 'HYP184'], involving atoms ['20-N']

KeyError: 'Could not identify PTM'

I debugged and realized that the error originates from the ISMAGS algorithm and the combination NAA / HYP.

Some background information first:

The pdb-file and ff HYP looks like: PDB: N,CA,C,O,CB,CG,CD2,OD1 // FF: N,CA,HA,C,O,CB,HB1,HB2,CG,HG,CD2,HD21,HD22,OD1,HD1

I aim to map 8 heavy atoms: The corresponding "to_be_mapped" array looks like {0,1,3,4,5,6,7} and the candidates dictonary like:

{5: frozenset({frozenset({5})}), N 1: frozenset({frozenset({0, 1, 2, 3, 4})}), CA,C,CB,CG,CD2 0: frozenset({frozenset({0, 1, 2, 3, 4})}), - " - 6: frozenset({frozenset({6, 7})}), O, OD1 2: frozenset({frozenset({0, 1, 2, 3, 4})}), CA,C,CB,CG,CD2 4: frozenset({frozenset({0, 1, 2, 3, 4})}), - " - 3: frozenset({frozenset({0, 1, 2, 3, 4})}), - " - 7: frozenset({frozenset({6, 7})}) O, OD1

If I understood correct: (idx from pdb-file : frozenset( idx from FF)) means Nitrogen is 5, Carbon atoms are 0-4 and Oxygens are 6-7.

Next the algorithm tries to find the isomorphism (between pdb-file and force field ?) in the _largest_common_subgraph / _map_nodes fct.

Here the algorithm gives me "StopsIteration" and the output mapping dictonary looks like {0:0,1:1,2:2,3:3,4:4,6:6}. As can be seen index 5 and 7 are not in the final mapping dictonary anymore. As a result, they are marked as "PTM-atoms" and since they are not mentioned in the "modifications.mapping" file, they can not be identified.

-> the algorithm kind of removes the subgraph nodes from the to_be_mapped array ?

Interestingly, the error described here happens only for the large protein system and not for the small one

Main differences between pdb file small system and fibril are: 1) small system has hydrogens in the pdb file and fibril pdb does not 2) small system contains HYP, but not NAA

My question would be: Do I have to treat the new amino acid also as a modification? Why does the ISMAGS algorithm not recognize nitrogen and the second oxygen? Probably they are deleted (see map_nodes-fct, line 610-635) Why is it working for the small system and not for the fibril.

I hope this is somehow clear, if not please let me know.

Any recommendation is highly appreciated.

Thanks a lot and best, Matthias

pckroon commented 2 years ago

Hey Matthias,

Nice work on the NME/ACE modifications! Feel free to open a PR for these once you're ready for them to be publicly available. I assume you found the documentation?

Do I have to treat the new amino acid also as a modification?

No, making it a "block" is fine/the preferred way. You can add them to aminoacids.rtp, create your own rtp files, or itp/ff files.

Why does the ISMAGS algorithm not recognize nitrogen and the second oxygen? Probably they are deleted (see map_nodes-fct, line 610-635)

ISMAGS determines (symmetrically unique) (sub)graph isomorphisms. This means that the connectivity of these atoms is taken into account in their identification (this also means that your atomnames don't have to match per se!).

Why is it working for the small system and not for the fibril.

Good question :) See below.

Number of molecules after guessing bonds are not in agreement with molecules guessed from PDF file.

This is a very bad sign. It probably means that some atoms are too close together, and then vermouth guesses a bond between them. The current algorithm adds bonds between atoms that should be bonded based on their atom names within blocks first, and then looks at distances. It takes the vdw radii of the atoms involved, and if they are not hydrogens, and close enough together (+-10%, see also --bonds-fudge), a bond will be added. This all can be tuned with the --bonds-from flag. Coincidentally this is the same algorithm that VMD uses.

I would start with investigating your structure to see /where/ the clashes are. These will probably also result in problems during your simulation later on, after all. Start your investigation around atom 20-N.

Try martinizing a small peptide including NAA. For example, take the first 5 amino acids. You can increase the verbosity with -v or even -vv. Try removing the NAA and martinizing the fibril you're left with.

Are the atoms that vermouth flags as 'missing' in your PDB? Atoms are identified based on connectivity: if they're too close (=bonded) to the wrong atoms they won't be picked up. Missing hydrogens on the other hand won't be an issue. You can also pass the '-ignh' flag to make martinize2 ignore all hydrogens in your input structure.

PS. the -mutate flag might also come in handy?

mbrosz commented 2 years ago

Thank you for your fast and rich-in-detail reply. I played around a little bit with the command I used to martinize my structure and finally got the large-fibril (without NAA) to work. As suggested I could solve the issue by using the "-bonds-from name" flag to force martinize2 to take the bonds from the ff (I assume). Now I am able to martinize my proteins (small system and also really large ones) (without NAA). This works pretty smoothly, thanks for this :)

As suggested, I set up a small capped chain with 4 amino acids (ACE-NAA-HYP-NME) and applied the martinize2 command with the "-bonds-from name" flag. Here everything worked perfectly: no error etc.

Now I wanted to scale-up and martinize the big system (the small chain with 4 amino acids equals the first chain of the fibril). However, here I run in the following error. (looks a bit like the error before, I just used the -v flag).

DEBUG - step - vermouth.processors.repair_graph - Repairing residue ACE182 DEBUG - step - vermouth.processors.repair_graph - Repairing residue NAA183 INFO - missing-atom - vermouth.processors.repair_graph - Missing atom NAA183:OE2 ERROR - missing-atom - vermouth.processors.repair_graph - Could not reconstruct atom NAA183:OE2 ERROR - missing-atom - vermouth.processors.repair_graph - Could not reconstruct atom NAA183:HH1 DEBUG - step - vermouth.processors.repair_graph - Repairing residue HYP184 INFO - missing-atom - vermouth.processors.repair_graph - Missing atom HYP184:N INFO - missing-atom - vermouth.processors.repair_graph - Missing atom HYP184:OD1 DEBUG - missing-atom - vermouth.processors.repair_graph - Adding 68A-HYP184:N DEBUG - missing-atom - vermouth.processors.repair_graph - Adding 75A-HYP184:OD1

Concerning the missing atoms, my pdb file has no hydrogens at all. However, there is still the OE2 atom. The OE2 atom is included in the force field rtp & pdb file. On the other hand, HH1 is only included in the rtp file.

Thank you again for your excellent and fast help regarding my issue. I will play around a bit and see if I can solve the issue by myself. However, I really appreciate any recommendation concerning my issue.

pckroon commented 2 years ago

the "-bonds-from name" flag.

Note that this means that you won't get bonds between residues! You'll need to supply those as CONECT records in this case.

ERROR - missing-atom - vermouth.processors.repair_graph - Could not reconstruct atom NAA183:OE2

This means you're missing an edge/bond in your NAA rtp file. Probably. This also explains why vermouth flags it as missing: connectivity matters.

Tsjerk commented 2 years ago

@pckroon Is this still an issue? The coordinate generation seems to fix part of it and what is left might be better off specified as a new one.

mbrosz commented 2 years ago

Sorry for my late reply, however I could solve my issue concerning the missing atom. So from my side, one can close this issue here.