Closed amjjbonvin closed 5 years ago
If I understood correctly, the implementation would be:
input (from PDB 1IGY):
ATOM 2529 N ASN B 52 -3.487 -37.610 -22.718 1.00 7.64 N
ATOM 2530 CA ASN B 52 -2.923 -38.940 -22.955 1.00 7.64 C
ATOM 2531 C ASN B 52 -1.518 -38.767 -23.550 1.00 7.64 C
ATOM 2532 O ASN B 52 -1.072 -37.623 -23.713 1.00 69.43 O
ATOM 2533 CB ASN B 52 -3.823 -39.738 -23.883 1.00 69.43 C
ATOM 2534 CG ASN B 52 -3.683 -39.305 -25.298 1.00 69.43 C
ATOM 2535 OD1 ASN B 52 -3.846 -38.127 -25.605 1.00 69.43 O
ATOM 2536 ND2 ASN B 52 -3.313 -40.237 -26.169 1.00 69.43 N
ATOM 2537 H ASN B 52 -3.256 -36.866 -23.312 1.00 15.00 H
ATOM 2538 HD21 ASN B 52 -3.216 -39.968 -27.105 1.00 15.00 H
ATOM 2539 HD22 ASN B 52 -3.148 -41.154 -25.865 1.00 15.00 H
ATOM 2540 N PRO B 52A -0.811 -39.882 -23.882 1.00 2.00 N
ATOM 2541 CA PRO B 52A 0.530 -39.717 -24.436 1.00 2.00 C
ATOM 2542 C PRO B 52A 0.551 -38.668 -25.518 1.00 2.00 C
ATOM 2543 O PRO B 52A 0.751 -37.490 -25.226 1.00 2.00 O
ATOM 2544 CB PRO B 52A 0.852 -41.118 -24.971 1.00 2.00 C
ATOM 2545 CG PRO B 52A 0.183 -42.000 -24.023 1.00 2.00 C
ATOM 2546 CD PRO B 52A -1.161 -41.317 -23.829 1.00 2.00 C
using: pdb_delinsert 1IGY.pdb
would select the insert without letter char
ATOM 2529 N ASN B 52 -3.487 -37.610 -22.718 1.00 7.64 N
ATOM 2530 CA ASN B 52 -2.923 -38.940 -22.955 1.00 7.64 C
ATOM 2531 C ASN B 52 -1.518 -38.767 -23.550 1.00 7.64 C
ATOM 2532 O ASN B 52 -1.072 -37.623 -23.713 1.00 69.43 O
ATOM 2533 CB ASN B 52 -3.823 -39.738 -23.883 1.00 69.43 C
ATOM 2534 CG ASN B 52 -3.683 -39.305 -25.298 1.00 69.43 C
ATOM 2535 OD1 ASN B 52 -3.846 -38.127 -25.605 1.00 69.43 O
ATOM 2536 ND2 ASN B 52 -3.313 -40.237 -26.169 1.00 69.43 N
ATOM 2537 H ASN B 52 -3.256 -36.866 -23.312 1.00 15.00 H
ATOM 2538 HD21 ASN B 52 -3.216 -39.968 -27.105 1.00 15.00 H
ATOM 2539 HD22 ASN B 52 -3.148 -41.154 -25.865 1.00 15.00 H
while an option would select the given insert, using: pdb_delinsert -A 1IGY.pdb
ATOM 2540 N PRO B 52A -0.811 -39.882 -23.882 1.00 2.00 N
ATOM 2541 CA PRO B 52A 0.530 -39.717 -24.436 1.00 2.00 C
ATOM 2542 C PRO B 52A 0.551 -38.668 -25.518 1.00 2.00 C
ATOM 2543 O PRO B 52A 0.751 -37.490 -25.226 1.00 2.00 O
ATOM 2544 CB PRO B 52A 0.852 -41.118 -24.971 1.00 2.00 C
ATOM 2545 CG PRO B 52A 0.183 -42.000 -24.023 1.00 2.00 C
ATOM 2546 CD PRO B 52A -1.161 -41.317 -23.829 1.00 2.00 C
is that so?
No! It's not about deleting insertions but making them part of the regular numbering, i.e.shifting the numbering to accommodate the insertions. The title is deinster
and not delinstert
:-)
input:
[truncated]
ATOM 2537 H ASN B 52 -3.256 -36.866 -23.312 1.00 15.00 H
ATOM 2538 HD21 ASN B 52 -3.216 -39.968 -27.105 1.00 15.00 H
ATOM 2539 HD22 ASN B 52 -3.148 -41.154 -25.865 1.00 15.00 H
ATOM 2540 N PRO B 52A -0.811 -39.882 -23.882 1.00 2.00 N
ATOM 2541 CA PRO B 52A 0.530 -39.717 -24.436 1.00 2.00 C
ATOM 2542 C PRO B 52A 0.551 -38.668 -25.518 1.00 2.00 C
[truncated]
output:
[truncated]
ATOM 2537 H ASN B 52 -3.256 -36.866 -23.312 1.00 15.00 H
ATOM 2538 HD21 ASN B 52 -3.216 -39.968 -27.105 1.00 15.00 H
ATOM 2539 HD22 ASN B 52 -3.148 -41.154 -25.865 1.00 15.00 H
ATOM 2540 N PRO B 53 -0.811 -39.882 -23.882 1.00 2.00 N
ATOM 2541 CA PRO B 53 0.530 -39.717 -24.436 1.00 2.00 C
ATOM 2542 C PRO B 53 0.551 -38.668 -25.518 1.00 2.00 C
[truncated]
That's it!
But insertions are there for a reason, I'd see 2 tools for this: delete insertions and select insertion, where the latter would just pick one and then the regular reres or shiftres would do the job?
They are there because of some old numbering convention of antibodies mainly. And this is messing up things in various software. These are not alternative conformations, but real residues. If you want to remove the A, B, C, ... after the residue number you need to renumber all subsequent residues. Since all inserted residues have the same residue number, pdb_reres and pdb_shift will not remove the residue numbering overlap
Again - I don't want to delete the insertion, but rather integrate them smoothly in the sequential residue numbering hence the name deinsert
and not delinsert
Following the first message of @amjjbonvin , just to clarify on the behavior of pdb_reres
On the above example, pdb_reres
currently outputs:
ATOM 2537 H ASN B 1 -3.256 -36.866 -23.312 1.00 15.00 H
ATOM 2538 HD21 ASN B 1 -3.216 -39.968 -27.105 1.00 15.00 H
ATOM 2539 HD22 ASN B 1 -3.148 -41.154 -25.865 1.00 15.00 H
ATOM 2540 N PRO B 2A -0.811 -39.882 -23.882 1.00 2.00 N
ATOM 2541 CA PRO B 2A 0.530 -39.717 -24.436 1.00 2.00 C
ATOM 2542 C PRO B 2A 0.551 -38.668 -25.518 1.00 2.00 C
should it be updated to:
ATOM 2537 H ASN B 1 -3.256 -36.866 -23.312 1.00 15.00 H
ATOM 2538 HD21 ASN B 1 -3.216 -39.968 -27.105 1.00 15.00 H
ATOM 2539 HD22 ASN B 1 -3.148 -41.154 -25.865 1.00 15.00 H
ATOM 2540 N PRO B 1A -0.811 -39.882 -23.882 1.00 2.00 N
ATOM 2541 CA PRO B 1A 0.530 -39.717 -24.436 1.00 2.00 C
ATOM 2542 C PRO B 1A 0.551 -38.668 -25.518 1.00 2.00 C
if so, I suggest to open a new issue for this matter.
Ok - so pdb_reres
would do the trick, but at the cost of loosing any gaps if present in the structure since all residues will be sequentially numbered
I edited pdb_reres
recently to identify residues and not follow the numbering blindly. This was causing all sorts of troubles when one chain ended in X and the next started with X.
Do you think selecting one and reres is a good idea?
Not sure what you mean here...
Sorry, I was in a rush yesterday writing these messages.
I meant that we should not duplicate functionality that already exists. The insertions are a bit of a problem because they can represent a ton of different things, including antibody loops.
I recently changed pdb_reres
to identify residues taking into account not only their number but also chain and insertion code. What if we change the behavior of pdb_reres
to preserve gaps (if it finds a gap in the original numbering it should add +1 to the new numbering to preserve it)? This, together with pdb_selicode
or whatever we are calling it (I would advocate for keeping names consistent with the field names in the PDB format specification) could be a more general solution to this problem?
@joaomcteixeira comments?
I might be wrong but wouldn't it be simpler to add an option to pdb_reres
to either preserve or not the gaps in the numbering? You leave the user decides what he wants to do. And we could potentially raise a warning via std.err when we find gaps and the option has not been added..?
And another extra option could potentially be applied to insertions.
I'm not sure there is one and only one "standardized" way to approach the problem. In my opinion we should leave it a bit of flexibility and allows the user to choose either of the options (or combination of options).
That's too many options.. I really wanted to keep it one option per script :-/
I would argue that as long as you are doing "normal" things this is simple, just use the script without any argument. Then we should adopt the most generic and standardized way to handle things.
In parallel, you left the possibility for advanced users to tune the behavior.
But I fully get your point and that's a decision to make. And anyway, if a combination of scripts can achieve the requested behavior then we should leave it that way but maybe document it (maybe a Recipes section in the documentation? For the tasks that could happen more than once and require some "tricky" pipelines?)
We do have pdb_shiftres
that preserves the gap, but only pdb_reres
will currently include insertions in the renumbering, but it will not preserve the gaps.
Those insertions are nasty. Which is why I was advertising a pdb_deinsert
that should include the insertions in the numbering and preserve the gaps. In that scenario pdb_reres
should may-be go back to the old behavior and keep the insertions.
Try finding an advanced scenario with the existing tools to "deinsert" the insertions and preserve the gaps. That's a challenge...
Following everything said, I would suggest:
pdb_reres
and pdb_shiftres
should ONLY change residue numbers, without affecting the letter of the insert labeling, in this way:
pdb_reres
should relabel residues and remove gaps, as the description says, Renumbers the residues of the PDB file starting from a given number (default 1).
, while:pdb_shiftres
should do what it says it does: Renumbers the residues of the PDB file by adding/subtracting a given number from the original numbering.
reres
not shiftres
.reres
should go back one version as said.pdb_deinsert
should care about this issue. The question is how? I think that what was proposed before (https://github.com/haddocking/pdb-tools/issues/13#issuecomment-447587632) is the correct behavior. In that sense, pdb_deinsert
should NOT reres before insertions and should just shift residue labels onward after insertions, therefore should not allow for options. If this is so, PR #14 should be rejected.pdb_deinsertshould care about this issue. The question is how? I think that what was proposed before (#13 (comment)) is the correct behavior. In that sense, pdb_deinsertshould NOT reres before insertions and should just shift residue labels onward after insertions, therefore should not allow for options. If this is so, PR #14 should be rejected.
I agree with this strategy.
PS: And pdb_deinsert
should then also remove the one letter label of the insertion otherwise pdb_wc
will still think there is an insertion.
PS: And pdb_deinsert should then also remove the one letter label of the insertion otherwise pdb_wc will still think there is an insertion.
Definitively
After thinking about this yesterday for a bit, here's my suggestion for a solution.
I wrote a pdb_delicode
tool that allows the user to pass an option to specify which insertions to delete, e.g. pdb_delicode -A99,B12
. This would remove insertions on chain A residue 99 and chain B residue 12. By default, removes all. It also pads the numbering of the residues downstream of the removed insertions.
I think this is quite flexible and addresses most of the issues we had here. It does renumber, but I think that's something we have to include for simplicity.
Thumbs up for approval and I will merge the code.
Sounds good. So the default is that all insertions are removed, correct?
But I am not convinced by the name - it is confusing since it implies it deletes insertions, which is not the case. So I would still be in favour of pdb_deinsert
or pdb_uninsert
Closed by commit 50ba6e1
Insertions in the residue sequence with a numbering such as
100A
,100B
, ... can be nasty for some programs. It would be nice to have a tool that will detect those and renumber sequentially the PDB file.pdb_reres
will preserve the insertions.