Closed multimeric closed 2 years ago
Also I think there is an argument for only using auth_seq_id
if it has a non empty (?
or .
) value, not just if the field exists at all.
It is actually very surprising that a HETATM does not have a proper residue number…
On 24 Mar 2022, at 16:06, Michael Milton @.***> wrote:
It is not surprising that a HETATM should not have a residue number, and so this is causing the issue in question. The dilemma though is how to actually choose a number for a HETATM that isn't related to one particular residue.
I know this is a common comment from tool developers, but I don't think this is a bug on our end.
The structure you're trying to parse does not really abide to the standard. If you look at a deposited RCSB structure of P04406 (e.g. https://files.rcsb.org/view/1U8F.cif) you'll see that the NAD residue (and even the waters) have auth_seq_id
numbers. Your structure is coming from alphafill, that adds cofactors to alphafold models, so I'm wondering if they blank the ids of the grafted ligands to avoid any clashes with existing residues. There's very little we can do on our end here, short of writing bad PDB files. I'd raise this error upstream with the developers of alphafill.
Fair enough. I can see in the spec that _atom_site.label_seq_id
is required (although auth_seq_id
is not, so I don't think it should crash if auth_seq_id
is a key that is present but it has no value).
What's weird is that label_seq_id
is supposed to reference an entry in the _entity_poly_seq
table, but in your example, the HETATM
s have a label_seq_id
, but it doesn't point to anything in the _entity_poly_seq
, which makes sense since it's not part of the polymer. So I guess I can just put in any unique integer in that column and it should make pdb-tools
happy.
I was reviewing this issue, and I agree with @JoaoRodrigues answer.
@multimeric could you solve your issue by adding the residue numbers manually?
So I did investigate this further. I went back to the AlphaFill people, but they argue that their structures are actually correct, because, as mentioned, it makes no sense for a non-peptide molecule to have a peptide index. In addition, auth_seq_id
is not mandatory. I'm inclined to agree with this reasoning.
So considering this, I found two solutions. The first, and best solution, is to do the CIF → PDB conversion using https://github.com/PDB-REDO/cif-tools (incidentally made by the same group that made AlphaFill). It does a remarkable job at preserving metadata and ensuring all the required headers are present.
The other method, which uses pdb-tools
, is indeed to "fix" the CIF file, which is something I wrote a little CLI tool to do (before realising the above solution): https://github.com/multimeric/mmcifix. You can pip install
it and then run mmcifix --fixer label_seq_id --fixer auth_seq_id
. The logic implemented by that tool is to just find the highest ID already existing in the file, and set the IDs of the HETATM
to higher IDs which are incrementing.
I am still not convinced :) The structure is not valid. It should have at least _atom_site.label_seq_id
that we would default on if the auth_
wasn't readable.
I downloaded the file you posted in your first message and ran it through pdb_fromcif
and it crashes, but because it finds a chain id that is incompatible with the PDB format (AA
, two characters). We could add an option to force reading label_seq_id
and friends (not *auth*
) but that's as far as I think we can go with such a rudimentary tool.
Our intention was never to support mmCIF files properly as they are a pain and not very amenable to the type of serial processing PDB files lend themselves very nicely to. I'll keep an eye on your mmcifix tools!
I'll close this issue but feel free to re-open if you think it's worth discussing further!
Describe the bug When
pdb_fromcif
encounters anyHETATM
that doesn't have alabel_seq_id
(which I would have thought would be none of them), we get:ValueError: invalid literal for int() with base 10: '?'
when handlingHETATM
To Reproduce Steps to reproduce the behavior. Preferrably, a code snippet.
Expected behavior Should output a PDB file without crashing
Desktop (please complete the following information): n/a
Discussion
Okay so what's happening here is that firstly we're treating
ATOM
andHETATM
the same: https://github.com/haddocking/pdb-tools/blob/3ffb016b501f9e42fdc7a5b1021ddccaa5ddd024/pdbtools/pdb_fromcif.py#L118Then, the issue is here, where we try to find
resnum
, the residue number: https://github.com/haddocking/pdb-tools/blob/3ffb016b501f9e42fdc7a5b1021ddccaa5ddd024/pdbtools/pdb_fromcif.py#L164-L167It is not surprising that a
HETATM
should not have a residue number, and so this is causing the issue in question. The dilemma though is how to actually choose a number for aHETATM
that isn't related to one particular residue. All of theHETATM
examples in the PDB docs do actually have an integer residue number, but they are only small molecules: https://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#HETATM