pdb_deinsert - Githubissues

amjjbonvin commented 5 years ago

Insertions in the residue sequence with a numbering such as 100A, 100B, ... can be nasty for some programs. It would be nice to have a tool that will detect those and renumber sequentially the PDB file. pdb_reres will preserve the insertions.

joaomcteixeira commented 5 years ago

If I understood correctly, the implementation would be:

input (from PDB 1IGY):

ATOM   2529  N   ASN B  52      -3.487 -37.610 -22.718  1.00  7.64           N  
ATOM   2530  CA  ASN B  52      -2.923 -38.940 -22.955  1.00  7.64           C  
ATOM   2531  C   ASN B  52      -1.518 -38.767 -23.550  1.00  7.64           C  
ATOM   2532  O   ASN B  52      -1.072 -37.623 -23.713  1.00 69.43           O  
ATOM   2533  CB  ASN B  52      -3.823 -39.738 -23.883  1.00 69.43           C  
ATOM   2534  CG  ASN B  52      -3.683 -39.305 -25.298  1.00 69.43           C  
ATOM   2535  OD1 ASN B  52      -3.846 -38.127 -25.605  1.00 69.43           O  
ATOM   2536  ND2 ASN B  52      -3.313 -40.237 -26.169  1.00 69.43           N  
ATOM   2537  H   ASN B  52      -3.256 -36.866 -23.312  1.00 15.00           H  
ATOM   2538 HD21 ASN B  52      -3.216 -39.968 -27.105  1.00 15.00           H  
ATOM   2539 HD22 ASN B  52      -3.148 -41.154 -25.865  1.00 15.00           H  
ATOM   2540  N   PRO B  52A     -0.811 -39.882 -23.882  1.00  2.00           N  
ATOM   2541  CA  PRO B  52A      0.530 -39.717 -24.436  1.00  2.00           C  
ATOM   2542  C   PRO B  52A      0.551 -38.668 -25.518  1.00  2.00           C  
ATOM   2543  O   PRO B  52A      0.751 -37.490 -25.226  1.00  2.00           O  
ATOM   2544  CB  PRO B  52A      0.852 -41.118 -24.971  1.00  2.00           C  
ATOM   2545  CG  PRO B  52A      0.183 -42.000 -24.023  1.00  2.00           C  
ATOM   2546  CD  PRO B  52A     -1.161 -41.317 -23.829  1.00  2.00           C

using: pdb_delinsert 1IGY.pdb would select the insert without letter char

ATOM   2529  N   ASN B  52      -3.487 -37.610 -22.718  1.00  7.64           N  
ATOM   2530  CA  ASN B  52      -2.923 -38.940 -22.955  1.00  7.64           C  
ATOM   2531  C   ASN B  52      -1.518 -38.767 -23.550  1.00  7.64           C  
ATOM   2532  O   ASN B  52      -1.072 -37.623 -23.713  1.00 69.43           O  
ATOM   2533  CB  ASN B  52      -3.823 -39.738 -23.883  1.00 69.43           C  
ATOM   2534  CG  ASN B  52      -3.683 -39.305 -25.298  1.00 69.43           C  
ATOM   2535  OD1 ASN B  52      -3.846 -38.127 -25.605  1.00 69.43           O  
ATOM   2536  ND2 ASN B  52      -3.313 -40.237 -26.169  1.00 69.43           N  
ATOM   2537  H   ASN B  52      -3.256 -36.866 -23.312  1.00 15.00           H  
ATOM   2538 HD21 ASN B  52      -3.216 -39.968 -27.105  1.00 15.00           H  
ATOM   2539 HD22 ASN B  52      -3.148 -41.154 -25.865  1.00 15.00           H

while an option would select the given insert, using: pdb_delinsert -A 1IGY.pdb

ATOM   2540  N   PRO B  52A     -0.811 -39.882 -23.882  1.00  2.00           N  
ATOM   2541  CA  PRO B  52A      0.530 -39.717 -24.436  1.00  2.00           C  
ATOM   2542  C   PRO B  52A      0.551 -38.668 -25.518  1.00  2.00           C  
ATOM   2543  O   PRO B  52A      0.751 -37.490 -25.226  1.00  2.00           O  
ATOM   2544  CB  PRO B  52A      0.852 -41.118 -24.971  1.00  2.00           C  
ATOM   2545  CG  PRO B  52A      0.183 -42.000 -24.023  1.00  2.00           C  
ATOM   2546  CD  PRO B  52A     -1.161 -41.317 -23.829  1.00  2.00           C

is that so?

amjjbonvin commented 5 years ago

No! It's not about deleting insertions but making them part of the regular numbering, i.e.shifting the numbering to accommodate the insertions. The title is deinster and not delinstert :-)

joaomcteixeira commented 5 years ago

input:

[truncated]
ATOM   2537  H   ASN B  52      -3.256 -36.866 -23.312  1.00 15.00           H  
ATOM   2538 HD21 ASN B  52      -3.216 -39.968 -27.105  1.00 15.00           H  
ATOM   2539 HD22 ASN B  52      -3.148 -41.154 -25.865  1.00 15.00           H  
ATOM   2540  N   PRO B  52A     -0.811 -39.882 -23.882  1.00  2.00           N  
ATOM   2541  CA  PRO B  52A      0.530 -39.717 -24.436  1.00  2.00           C  
ATOM   2542  C   PRO B  52A      0.551 -38.668 -25.518  1.00  2.00           C  
[truncated]

output:

[truncated]
ATOM   2537  H   ASN B  52      -3.256 -36.866 -23.312  1.00 15.00           H  
ATOM   2538 HD21 ASN B  52      -3.216 -39.968 -27.105  1.00 15.00           H  
ATOM   2539 HD22 ASN B  52      -3.148 -41.154 -25.865  1.00 15.00           H  
ATOM   2540  N   PRO B  53      -0.811 -39.882 -23.882  1.00  2.00           N  
ATOM   2541  CA  PRO B  53       0.530 -39.717 -24.436  1.00  2.00           C  
ATOM   2542  C   PRO B  53       0.551 -38.668 -25.518  1.00  2.00           C  
[truncated]

amjjbonvin commented 5 years ago

That's it!

JoaoRodrigues commented 5 years ago

But insertions are there for a reason, I'd see 2 tools for this: delete insertions and select insertion, where the latter would just pick one and then the regular reres or shiftres would do the job?

amjjbonvin commented 5 years ago

They are there because of some old numbering convention of antibodies mainly. And this is messing up things in various software. These are not alternative conformations, but real residues. If you want to remove the A, B, C, ... after the residue number you need to renumber all subsequent residues. Since all inserted residues have the same residue number, pdb_reres and pdb_shift will not remove the residue numbering overlap

Again - I don't want to delete the insertion, but rather integrate them smoothly in the sequential residue numbering hence the name deinsert and not delinsert

joaomcteixeira commented 5 years ago

Following the first message of @amjjbonvin , just to clarify on the behavior of pdb_reres

On the above example, pdb_reres currently outputs:

ATOM   2537  H   ASN B   1      -3.256 -36.866 -23.312  1.00 15.00           H  
ATOM   2538 HD21 ASN B   1      -3.216 -39.968 -27.105  1.00 15.00           H  
ATOM   2539 HD22 ASN B   1      -3.148 -41.154 -25.865  1.00 15.00           H  
ATOM   2540  N   PRO B   2A     -0.811 -39.882 -23.882  1.00  2.00           N  
ATOM   2541  CA  PRO B   2A      0.530 -39.717 -24.436  1.00  2.00           C  
ATOM   2542  C   PRO B   2A      0.551 -38.668 -25.518  1.00  2.00           C

should it be updated to:

ATOM   2537  H   ASN B   1      -3.256 -36.866 -23.312  1.00 15.00           H  
ATOM   2538 HD21 ASN B   1      -3.216 -39.968 -27.105  1.00 15.00           H  
ATOM   2539 HD22 ASN B   1      -3.148 -41.154 -25.865  1.00 15.00           H  
ATOM   2540  N   PRO B   1A     -0.811 -39.882 -23.882  1.00  2.00           N  
ATOM   2541  CA  PRO B   1A      0.530 -39.717 -24.436  1.00  2.00           C  
ATOM   2542  C   PRO B   1A      0.551 -38.668 -25.518  1.00  2.00           C

if so, I suggest to open a new issue for this matter.

amjjbonvin commented 5 years ago

Ok - so pdb_reres would do the trick, but at the cost of loosing any gaps if present in the structure since all residues will be sequentially numbered

JoaoRodrigues commented 5 years ago

I edited pdb_reres recently to identify residues and not follow the numbering blindly. This was causing all sorts of troubles when one chain ended in X and the next started with X.

Do you think selecting one and reres is a good idea?

amjjbonvin commented 5 years ago

Not sure what you mean here...

JoaoRodrigues commented 5 years ago

Sorry, I was in a rush yesterday writing these messages.

I meant that we should not duplicate functionality that already exists. The insertions are a bit of a problem because they can represent a ton of different things, including antibody loops.

I recently changed pdb_reres to identify residues taking into account not only their number but also chain and insertion code. What if we change the behavior of pdb_reres to preserve gaps (if it finds a gap in the original numbering it should add +1 to the new numbering to preserve it)? This, together with pdb_selicode or whatever we are calling it (I would advocate for keeping names consistent with the field names in the PDB format specification) could be a more general solution to this problem?

@joaomcteixeira comments?

mtrellet commented 5 years ago

I might be wrong but wouldn't it be simpler to add an option to pdb_reres to either preserve or not the gaps in the numbering? You leave the user decides what he wants to do. And we could potentially raise a warning via std.err when we find gaps and the option has not been added..? And another extra option could potentially be applied to insertions. I'm not sure there is one and only one "standardized" way to approach the problem. In my opinion we should leave it a bit of flexibility and allows the user to choose either of the options (or combination of options).

JoaoRodrigues commented 5 years ago

That's too many options.. I really wanted to keep it one option per script :-/

mtrellet commented 5 years ago

I would argue that as long as you are doing "normal" things this is simple, just use the script without any argument. Then we should adopt the most generic and standardized way to handle things.

In parallel, you left the possibility for advanced users to tune the behavior.

But I fully get your point and that's a decision to make. And anyway, if a combination of scripts can achieve the requested behavior then we should leave it that way but maybe document it (maybe a Recipes section in the documentation? For the tasks that could happen more than once and require some "tricky" pipelines?)

amjjbonvin commented 5 years ago

We do have pdb_shiftres that preserves the gap, but only pdb_reres will currently include insertions in the renumbering, but it will not preserve the gaps.

Those insertions are nasty. Which is why I was advertising a pdb_deinsert that should include the insertions in the numbering and preserve the gaps. In that scenario pdb_reres should may-be go back to the old behavior and keep the insertions.

Try finding an advanced scenario with the existing tools to "deinsert" the insertions and preserve the gaps. That's a challenge...

joaomcteixeira commented 5 years ago

Following everything said, I would suggest:

both pdb_reres and pdb_shiftres should ONLY change residue numbers, without affecting the letter of the insert labeling, in this way:
- pdb_reres should relabel residues and remove gaps, as the description says, Renumbers the residues of the PDB file starting from a given number (default 1)., while:
- pdb_shiftres should do what it says it does: Renumbers the residues of the PDB file by adding/subtracting a given number from the original numbering.
- looking for inserts is not the task of reres not shiftres.
- so maybe reres should go back one version as said.
pdb_deinsertshould care about this issue. The question is how? I think that what was proposed before (https://github.com/haddocking/pdb-tools/issues/13#issuecomment-447587632) is the correct behavior. In that sense, pdb_deinsertshould NOT reres before insertions and should just shift residue labels onward after insertions, therefore should not allow for options. If this is so, PR #14 should be rejected.
I do not agree with @mtrellet allowing users to decide for options of that kind. When I was in Standford with @JoaoRodrigues he embedded in my brain the 1tool1job philosophy and now I really understand it and like it for pdb-tools, will advocate for it to the maximum until the owners of the project decide otherwise :-P
- on your defense @mtrellet I do have developed derivatives of pdb-tools that allow for several options, but those lie on my fork :-P
Finally, as @mtrellet said, I think the project really needs a file describing shortly every tool followed by an example, so that situations like #11 and even this one do not repeat.

amjjbonvin commented 5 years ago

pdb_deinsertshould care about this issue. The question is how? I think that what was proposed before (#13 (comment)) is the correct behavior. In that sense, pdb_deinsertshould NOT reres before insertions and should just shift residue labels onward after insertions, therefore should not allow for options. If this is so, PR #14 should be rejected.

I agree with this strategy.

PS: And pdb_deinsert should then also remove the one letter label of the insertion otherwise pdb_wc will still think there is an insertion.

joaomcteixeira commented 5 years ago

PS: And pdb_deinsert should then also remove the one letter label of the insertion otherwise pdb_wc will still think there is an insertion.

Definitively

JoaoRodrigues commented 5 years ago

After thinking about this yesterday for a bit, here's my suggestion for a solution.

I wrote a pdb_delicode tool that allows the user to pass an option to specify which insertions to delete, e.g. pdb_delicode -A99,B12. This would remove insertions on chain A residue 99 and chain B residue 12. By default, removes all. It also pads the numbering of the residues downstream of the removed insertions.

I think this is quite flexible and addresses most of the issues we had here. It does renumber, but I think that's something we have to include for simplicity.

Thumbs up for approval and I will merge the code.

amjjbonvin commented 5 years ago

Sounds good. So the default is that all insertions are removed, correct?

But I am not convinced by the name - it is confusing since it implies it deletes insertions, which is not the case. So I would still be in favour of pdb_deinsert or pdb_uninsert

amjjbonvin commented 5 years ago

Closed by commit 50ba6e1

haddocking / pdb-tools

pdb_deinsert #13