haddocking / pdb-tools

A dependency-free cross-platform swiss army knife for PDB files.
https://haddocking.github.io/pdb-tools/
Apache License 2.0
369 stars 112 forks source link

pdb_selaltloc removes residues with (only) alternate locations #153

Open mgiulini opened 1 year ago

mgiulini commented 1 year ago

Describe the bug pdb_selaltloc removes residues in which all atoms show alternate locations

To Reproduce

grep -rn pdb4xoj.pdb -e "SER A  85"
pdb_selaltloc pdb4xoj.pdb > pdb4xoj-occ.pdb
grep -rn pdb4xoj-occ.pdb -e "SER A  85"

Expected behavior pdb_selaltloc should not delete the residue but rather pick one of the two available alternate locations

Desktop (please complete the following information):

pdb4xoj.pdb.zip

joaomcteixeira commented 1 year ago

Thanks for this @mgiulini I will try to address it ASAP. Cheers,

mgiulini commented 1 year ago

thank you @joaomcteixeira I just linked the corresponding issue in the artic3d repo

joaomcteixeira commented 1 year ago

Hi, sorry for the delay in addressing this issue. I have been a bit overwhelmed on other fronts. I will address it in the next few weeks. It seems something that needs to be inspected carefully.

mgiulini commented 1 year ago

hi Joao! no worries, I could actually try to give a look into it by myself, as I would like to use pdb_selaltloc in another application, what do you say?

amjjbonvin commented 1 year ago

This is a serious issue that should be solved.

mgiulini commented 1 year ago

I'll look into it!

joaomcteixeira commented 1 year ago

Go for it. Look carefully at the way @JoaoRodrigues and I designed the tests. I suggest you first write the tests for this new case and then try to correct the code for that. Be sure first that the PDB is formatted well. Some issues with PDBs are unsolvable because the PDB itself is wrongly created. Best,

mgiulini commented 1 year ago

I understood the problem, but I think it would take ages for me to solve it without breaking the code, as the algorithm is very complicated.

the problem occurs when flush_resloc_occ is called https://github.com/haddocking/pdb-tools/blob/2a070bbacee9d6608b44bb6d2f749beefd6a7690/pdbtools/pdb_selaltloc.py#L302-L318

You can clearly see from the code from the code that if the input dictionary altloc_lines has a key with highest possible occupancy (e.g., an atom with no alternate location, occupancy = 1.0), then the function will ignore all the other lines, no matter their occupancy. As an example, residues LYS84 and SER85 of the pdb 4xoj give rise to the following lines

{' ': ['ATOM    536  N   LYS A  84      -4.827  -2.055  19.735  1.00  9.68           N  \n', ... some other coordinates], 
'A': ['ATOM    537  CA ALYS A  84      -4.644  -3.431  19.279  0.70 10.05           C  \n', ... some other coordinates],
'B': ['ATOM    538  CA BLYS A  84      -4.643  -3.451  19.329  0.30 10.14           C  \n', ... some other coordinates]}

according to the function, the first key (' ') is selected and the others are simply ignored.

Possible solutions could be:

I tried the second option, but it broke the tests..

joaomcteixeira commented 1 year ago

I would say the first option sounds better. No problem accommodating the if clause.

Also, feel free to adventure yourself rewriting the inner code if you feel so :smiling_imp: . As long as the already implemented test cases pass and the API itself is kept. But first, i suggest trying the if-clause to see what happens. Good luck and many thanks

joaomcteixeira commented 1 year ago

Thanks for all this input Marco. Addressing it in steps:

1)

according to the old implementation the residue ILE 25 should be excluded, but in my opinion that's wrong. (in #154 coment)

I revisited the tests, and they are correct to my knowledge on altlocs. ILE 25 is excluded if users don't give any option because it has the lowest occupancy. To select the ILE, the command has a trick; users have to specify the space:

pdb_selaltloc -' ' tests/data/dummy_altloc.pdb

In this way, ILE 25 is selected instead of LEU 25. Btw, occupancy should add up to one, and in the example, dummy numbers are used (0.61 and 0.90).

2)

There are several errors associated with this PDB when using pdb_selaltloc. On the one hand, SER 85 disappears. On the other hand, several atoms from other residues also disappear if we give -A or -B as options.

This PDB has an altloc formatting I have never seen before, that is: Atoms with alternate locations do NOT have the altloc identifier of the default. For example, the default is ' ' (space) for single location atom, while altlocs use directly to A and B and no ' '. I had never seen this before and completely overlooked this possibility when I developed the current version of selaltloc.

My question: Is this a correct altloc formatting? @JoaoRodrigues @amjjbonvin need your help with this.

If yes, I will need to rewrite the script considerably. It's not a problem, but I want to double-check with you before doing it.

Cheers!

amjjbonvin commented 1 year ago

This is a weird one… But if it comes from the PDB then we should ideally handle it.

My question: Is this a correct altloc formatting? @JoaoRodrigues https://github.com/JoaoRodrigues @amjjbonvin https://github.com/amjjbonvin need your help with this.

joaomcteixeira commented 1 year ago

As usual, I went to bed and woke up thinking about it. I am tempted to say it is complicated (or even impossible) to cover this scenario in the current pdb_selaltloc implementation without twisting the algorithm. The current algorithm is not complex. It's an engine that relies on identifying when a new altloc group appears code. But yesterday I couldn't figure out a way to cover this case cleanly and without adding patch code here and there.

I am tempted to rewrite the whole script and return it to something similar to the original implementation: first collect, then flush; contrarily, to flush-on-the-fly as it is now. Obviously, considering all the new tests and edge cases encountered in the last years that contributed to this change in the algorithm.

I will keep you posted. I will work on this today.

mgiulini commented 1 year ago

Thanks for all this input Marco. Addressing it in steps:

1)

according to the old implementation the residue ILE 25 should be excluded, but in my opinion that's wrong. (in #154 coment)

I revisited the tests, and they are correct to my knowledge on altlocs. ILE 25 is excluded if users don't give any option because it has the lowest occupancy. To select the ILE, the command has a trick; users have to specify the space:

pdb_selaltloc -' ' tests/data/dummy_altloc.pdb

In this way, ILE 25 is selected instead of LEU 25. Btw, occupancy should add up to one, and in the example, dummy numbers are used (0.61 and 0.90).

Hi there! not sure I agree about this: ILE 25 does not show any alternate locations, right? in my opinion it should be kept in the pdb by pdb_selaltloc, since in this case there's probably only a numbering issue..there're examples where two alternate locations correspond to different amino acids (such as residue 11 of https://www.ebi.ac.uk/pdbe/entry-files/pdb1alx.ent), but in that case you have different location labels (ATYR and BTRP). I would not remove full residues just because at the same residue ID there's another amino acid with some alternate locations. But this is up to your design choices of course:)

2)

There are several errors associated with this PDB when using pdb_selaltloc. On the one hand, SER 85 disappears. On the other hand, several atoms from other residues also disappear if we give -A or -B as options.

This PDB has an altloc formatting I have never seen before, that is: Atoms with alternate locations do NOT have the altloc identifier of the default. For example, the default is ' ' (space) for single location atom, while altlocs use directly to A and B and no ' '. I had never seen this before and completely overlooked this possibility when I developed the current version of selaltloc.

have you checked my solution in https://github.com/haddocking/pdb-tools/pull/154? probably not super elegant, but it might be a good starting point

amjjbonvin commented 1 year ago

Is there? Or are both LEU25 and ILE25 at the same position in space? This calls for some visual inspection

Hi there! not sure I agree about this: ILE 25 does not show any alternate locations, right? in my opinion it should be kept in the pdb by pdb_selaltloc, since in this case there's probably only a numbering issue..there're examples where two

mgiulini commented 1 year ago

It is there indeed (ILE25 and LEU25 are almost superimposed), but in the original file https://files.rcsb.org/view/3U7T.pdb ILE25 has the alternate location label A, so basically there are AILE and BLEU at position 25. It is in this case that it makes sense to select only one of them.

joaomcteixeira commented 1 year ago

We are talking about the test case

The way I see it, ILE 25 and LEU 25 are two options of the same alt loc group

https://github.com/haddocking/pdb-tools/blob/2a070bbacee9d6608b44bb6d2f749beefd6a7690/tests/data/dummy_altloc.pdb#L34-L49

It is the same as for ASER 22 and BPRO 22 at the beginning of the file, but in this case, the first alternative location is .

have you checked my solution in #154? Probably not super elegant, but it might be a good starting point

Yes, I saw. Thanks. But it's not clean in the sense of the implemented algorithm because it is a patch code trying to overcome a problem at the is_another_altloc_group function. And, it breaks two tests. We need something that is clean from scratch. I will try to do it today.

mgiulini commented 1 year ago

The way I see it, ILE 25 and LEU 25 are two options of the same alt loc group

here is the question: should we consider the simple space " " an alternative location? I would not do that, but it's a design choice.

have you checked my solution in #154? Probably not super elegant, but it might be a good starting point

Yes, I saw. Thanks. But it's not clean in the sense of the implemented algorithm because it is a patch code trying to overcome a problem at the is_another_altloc_group function. And, it breaks two tests. We need something that is clean from scratch. I will try to do it today.

sure, as you wish

JoaoRodrigues commented 1 year ago

I think there's some confusion about the heterogeneity possible in PDB files. Altlocs, specifically and by design, are meant to represent multiple instances of the same atom (same chain, resnum, inscode, atom name) with different occupancies. Indeed, the spec says that if there are multiple altlocs, they all should have non-blank altloc letters, but this is clearly not the case in several files in the wild.

In this case specifically that you mention Marco, I would consider these two instances of the same residue. Basically, my guess is that the authors couldn't figure out it this is a ILE or a LEU and decided to put both possibilities in.

My solution for this is to treat those ILE and LEU as different conformations of the same residue (they do have partial occupancies that add to 1) and simply pick the one with highest occupancy. If you abuse the format, well, tough luck.. We should identify residues as (chain, resnum, inscode), which is what the spec says.

I'll chat with João offline about a good implementation.

mgiulini commented 1 year ago

Hi Joao, yes, your solution makes sense..if the user submits a pdb with two amino acids having same chain-resnum..well, there's nothing you can do about it the important thing is that the code should never delete a full residue (such as SER A 85 in the example) but rather pick one of the available altlocs

joaomcteixeira commented 1 year ago

We are having some discussions on slack, but this one is relevant to register here:

In the case PDB @mgiulini sent. For example, for SER 85, altlocs A and B have the same occupancy of 0.50. What should we do for these cases when users run pdb_selaltloc without specifying a selection option, that is, select by occupancy?

  1. Should we select the first conformation?
  2. Should we select one of the conformations randomly?
  3. Should we output both and maintain the altloc character? (note altloc characters are deleted after selaltloc)
JoaoRodrigues commented 1 year ago

Pick the first.

Message ID: @.***>

amjjbonvin commented 1 year ago

Simply select the 1st one