Open mgiulini opened 1 year ago
Thanks for this @mgiulini I will try to address it ASAP. Cheers,
thank you @joaomcteixeira I just linked the corresponding issue in the artic3d repo
Hi, sorry for the delay in addressing this issue. I have been a bit overwhelmed on other fronts. I will address it in the next few weeks. It seems something that needs to be inspected carefully.
hi Joao! no worries, I could actually try to give a look into it by myself, as I would like to use pdb_selaltloc
in another application, what do you say?
This is a serious issue that should be solved.
I'll look into it!
Go for it. Look carefully at the way @JoaoRodrigues and I designed the tests. I suggest you first write the tests for this new case and then try to correct the code for that. Be sure first that the PDB is formatted well. Some issues with PDBs are unsolvable because the PDB itself is wrongly created. Best,
I understood the problem, but I think it would take ages for me to solve it without breaking the code, as the algorithm is very complicated.
the problem occurs when flush_resloc_occ
is called https://github.com/haddocking/pdb-tools/blob/2a070bbacee9d6608b44bb6d2f749beefd6a7690/pdbtools/pdb_selaltloc.py#L302-L318
You can clearly see from the code from the code that if the input dictionary altloc_lines
has a key with highest possible occupancy (e.g., an atom with no alternate location, occupancy = 1.0), then the function will ignore all the other lines, no matter their occupancy. As an example, residues LYS84 and SER85 of the pdb 4xoj give rise to the following lines
{' ': ['ATOM 536 N LYS A 84 -4.827 -2.055 19.735 1.00 9.68 N \n', ... some other coordinates],
'A': ['ATOM 537 CA ALYS A 84 -4.644 -3.431 19.279 0.70 10.05 C \n', ... some other coordinates],
'B': ['ATOM 538 CA BLYS A 84 -4.643 -3.451 19.329 0.30 10.14 C \n', ... some other coordinates]}
according to the function, the first key (' ') is selected and the others are simply ignored.
Possible solutions could be:
flush_func_multi_residues
logic and read the residues sequentially (according to their number)I tried the second option, but it broke the tests..
I would say the first option sounds better. No problem accommodating the if
clause.
Also, feel free to adventure yourself rewriting the inner code if you feel so :smiling_imp: . As long as the already implemented test cases pass and the API itself is kept. But first, i suggest trying the if-clause to see what happens. Good luck and many thanks
Thanks for all this input Marco. Addressing it in steps:
1)
according to the old implementation the residue ILE 25 should be excluded, but in my opinion that's wrong. (in #154 coment)
I revisited the tests, and they are correct to my knowledge on altlocs. ILE 25
is excluded if users don't give any option because it has the lowest occupancy. To select the ILE, the command has a trick; users have to specify the space:
pdb_selaltloc -' ' tests/data/dummy_altloc.pdb
In this way, ILE 25
is selected instead of LEU 25
. Btw, occupancy should add up to one, and in the example, dummy numbers are used (0.61 and 0.90).
2)
There are several errors associated with this PDB when using pdb_selaltloc
. On the one hand, SER 85
disappears. On the other hand, several atoms from other residues also disappear if we give -A
or -B
as options.
This PDB has an altloc
formatting I have never seen before, that is: Atoms with alternate locations do NOT have the altloc identifier of the default. For example, the default is ' '
(space) for single location atom, while altlocs use directly to A
and B
and no ' '
. I had never seen this before and completely overlooked this possibility when I developed the current version of selaltloc
.
My question: Is this a correct altloc formatting? @JoaoRodrigues @amjjbonvin need your help with this.
If yes, I will need to rewrite the script considerably. It's not a problem, but I want to double-check with you before doing it.
Cheers!
This is a weird one… But if it comes from the PDB then we should ideally handle it.
My question: Is this a correct altloc formatting? @JoaoRodrigues https://github.com/JoaoRodrigues @amjjbonvin https://github.com/amjjbonvin need your help with this.
As usual, I went to bed and woke up thinking about it. I am tempted to say it is complicated (or even impossible) to cover this scenario in the current pdb_selaltloc
implementation without twisting the algorithm. The current algorithm is not complex. It's an engine that relies on identifying when a new altloc group appears code. But yesterday I couldn't figure out a way to cover this case cleanly and without adding patch code here and there.
I am tempted to rewrite the whole script and return it to something similar to the original implementation: first collect, then flush; contrarily, to flush-on-the-fly as it is now. Obviously, considering all the new tests and edge cases encountered in the last years that contributed to this change in the algorithm.
I will keep you posted. I will work on this today.
Thanks for all this input Marco. Addressing it in steps:
1)
according to the old implementation the residue ILE 25 should be excluded, but in my opinion that's wrong. (in #154 coment)
I revisited the tests, and they are correct to my knowledge on altlocs.
ILE 25
is excluded if users don't give any option because it has the lowest occupancy. To select the ILE, the command has a trick; users have to specify the space:
pdb_selaltloc -' ' tests/data/dummy_altloc.pdb
In this way,
ILE 25
is selected instead ofLEU 25
. Btw, occupancy should add up to one, and in the example, dummy numbers are used (0.61 and 0.90).
Hi there! not sure I agree about this: ILE 25 does not show any alternate locations, right? in my opinion it should be kept in the pdb by pdb_selaltloc
, since in this case there's probably only a numbering issue..there're examples where two alternate locations correspond to different amino acids (such as residue 11 of https://www.ebi.ac.uk/pdbe/entry-files/pdb1alx.ent), but in that case you have different location labels (ATYR and BTRP). I would not remove full residues just because at the same residue ID there's another amino acid with some alternate locations. But this is up to your design choices of course:)
2)
There are several errors associated with this PDB when using
pdb_selaltloc
. On the one hand,SER 85
disappears. On the other hand, several atoms from other residues also disappear if we give-A
or-B
as options.This PDB has an
altloc
formatting I have never seen before, that is: Atoms with alternate locations do NOT have the altloc identifier of the default. For example, the default is' '
(space) for single location atom, while altlocs use directly toA
andB
and no' '
. I had never seen this before and completely overlooked this possibility when I developed the current version ofselaltloc
.
have you checked my solution in https://github.com/haddocking/pdb-tools/pull/154? probably not super elegant, but it might be a good starting point
Is there? Or are both LEU25 and ILE25 at the same position in space? This calls for some visual inspection
Hi there! not sure I agree about this: ILE 25 does not show any alternate locations, right? in my opinion it should be kept in the pdb by pdb_selaltloc, since in this case there's probably only a numbering issue..there're examples where two
It is there indeed (ILE25 and LEU25 are almost superimposed), but in the original file https://files.rcsb.org/view/3U7T.pdb ILE25 has the alternate location label A, so basically there are AILE and BLEU at position 25. It is in this case that it makes sense to select only one of them.
We are talking about the test case
The way I see it, ILE 25
and LEU 25
are two options of the same alt loc group
It is the same as for ASER 22
and BPRO 22
at the beginning of the file, but in this case, the first alternative location is
.
have you checked my solution in #154? Probably not super elegant, but it might be a good starting point
Yes, I saw. Thanks. But it's not clean in the sense of the implemented algorithm because it is a patch code trying to overcome a problem at the is_another_altloc_group
function. And, it breaks two tests. We need something that is clean from scratch. I will try to do it today.
The way I see it,
ILE 25
andLEU 25
are two options of the same alt loc group
here is the question: should we consider the simple space " "
an alternative location? I would not do that, but it's a design choice.
have you checked my solution in #154? Probably not super elegant, but it might be a good starting point
Yes, I saw. Thanks. But it's not clean in the sense of the implemented algorithm because it is a patch code trying to overcome a problem at the
is_another_altloc_group
function. And, it breaks two tests. We need something that is clean from scratch. I will try to do it today.
sure, as you wish
I think there's some confusion about the heterogeneity possible in PDB files. Altlocs, specifically and by design, are meant to represent multiple instances of the same atom (same chain, resnum, inscode, atom name) with different occupancies. Indeed, the spec says that if there are multiple altlocs, they all should have non-blank altloc letters, but this is clearly not the case in several files in the wild.
In this case specifically that you mention Marco, I would consider these two instances of the same residue. Basically, my guess is that the authors couldn't figure out it this is a ILE or a LEU and decided to put both possibilities in.
My solution for this is to treat those ILE and LEU as different conformations of the same residue (they do have partial occupancies that add to 1) and simply pick the one with highest occupancy. If you abuse the format, well, tough luck.. We should identify residues as (chain, resnum, inscode), which is what the spec says.
I'll chat with João offline about a good implementation.
Hi Joao, yes, your solution makes sense..if the user submits a pdb with two amino acids having same chain-resnum..well, there's nothing you can do about it
the important thing is that the code should never delete a full residue (such as SER A 85
in the example) but rather pick one of the available altlocs
We are having some discussions on slack, but this one is relevant to register here:
In the case PDB @mgiulini sent. For example, for SER 85
, altlocs A
and B
have the same occupancy of 0.50
. What should we do for these cases when users run pdb_selaltloc
without specifying a selection option, that is, select by occupancy?
selaltloc
)Pick the first.
Message ID: @.***>
Simply select the 1st one
Describe the bug pdb_selaltloc removes residues in which all atoms show alternate locations
To Reproduce
Expected behavior
pdb_selaltloc
should not delete the residue but rather pick one of the two available alternate locationsDesktop (please complete the following information):
pdb4xoj.pdb.zip