`ValueError: invalid literal for int() with base 10: '?'` when handling `HETATM`

multimeric commented 2 years ago

Describe the bug When pdb_fromcif encounters any HETATM that doesn't have a label_seq_id (which I would have thought would be none of them), we get: ValueError: invalid literal for int() with base 10: '?' when handling HETATM

To Reproduce Steps to reproduce the behavior. Preferrably, a code snippet.

curl https://alphafill.eu/v1/aff/P04406 | pdb_fromcif

Expected behavior Should output a PDB file without crashing

Desktop (please complete the following information): n/a

Discussion

Okay so what's happening here is that firstly we're treating ATOM and HETATM the same: https://github.com/haddocking/pdb-tools/blob/3ffb016b501f9e42fdc7a5b1021ddccaa5ddd024/pdbtools/pdb_fromcif.py#L118

Then, the issue is here, where we try to find resnum, the residue number: https://github.com/haddocking/pdb-tools/blob/3ffb016b501f9e42fdc7a5b1021ddccaa5ddd024/pdbtools/pdb_fromcif.py#L164-L167

It is not surprising that a HETATM should not have a residue number, and so this is causing the issue in question. The dilemma though is how to actually choose a number for a HETATM that isn't related to one particular residue. All of the HETATM examples in the PDB docs do actually have an integer residue number, but they are only small molecules: https://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#HETATM

multimeric commented 2 years ago

Also I think there is an argument for only using auth_seq_id if it has a non empty (? or .) value, not just if the field exists at all.

amjjbonvin commented 2 years ago

It is actually very surprising that a HETATM does not have a proper residue number…

On 24 Mar 2022, at 16:06, Michael Milton @.***> wrote:

It is not surprising that a HETATM should not have a residue number, and so this is causing the issue in question. The dilemma though is how to actually choose a number for a HETATM that isn't related to one particular residue.

JoaoRodrigues commented 2 years ago

I know this is a common comment from tool developers, but I don't think this is a bug on our end.

The structure you're trying to parse does not really abide to the standard. If you look at a deposited RCSB structure of P04406 (e.g. https://files.rcsb.org/view/1U8F.cif) you'll see that the NAD residue (and even the waters) have auth_seq_id numbers. Your structure is coming from alphafill, that adds cofactors to alphafold models, so I'm wondering if they blank the ids of the grafted ligands to avoid any clashes with existing residues. There's very little we can do on our end here, short of writing bad PDB files. I'd raise this error upstream with the developers of alphafill.

multimeric commented 2 years ago

Fair enough. I can see in the spec that _atom_site.label_seq_id is required (although auth_seq_id is not, so I don't think it should crash if auth_seq_id is a key that is present but it has no value).

What's weird is that label_seq_id is supposed to reference an entry in the _entity_poly_seq table, but in your example, the HETATMs have a label_seq_id, but it doesn't point to anything in the _entity_poly_seq, which makes sense since it's not part of the polymer. So I guess I can just put in any unique integer in that column and it should make pdb-tools happy.

joaomcteixeira commented 2 years ago

I was reviewing this issue, and I agree with @JoaoRodrigues answer.

@multimeric could you solve your issue by adding the residue numbers manually?

multimeric commented 2 years ago

So I did investigate this further. I went back to the AlphaFill people, but they argue that their structures are actually correct, because, as mentioned, it makes no sense for a non-peptide molecule to have a peptide index. In addition, auth_seq_id is not mandatory. I'm inclined to agree with this reasoning.

So considering this, I found two solutions. The first, and best solution, is to do the CIF → PDB conversion using https://github.com/PDB-REDO/cif-tools (incidentally made by the same group that made AlphaFill). It does a remarkable job at preserving metadata and ensuring all the required headers are present.

The other method, which uses pdb-tools, is indeed to "fix" the CIF file, which is something I wrote a little CLI tool to do (before realising the above solution): https://github.com/multimeric/mmcifix. You can pip install it and then run mmcifix --fixer label_seq_id --fixer auth_seq_id. The logic implemented by that tool is to just find the highest ID already existing in the file, and set the IDs of the HETATM to higher IDs which are incrementing.

JoaoRodrigues commented 2 years ago

I am still not convinced :) The structure is not valid. It should have at least _atom_site.label_seq_id that we would default on if the auth_ wasn't readable.

I downloaded the file you posted in your first message and ran it through pdb_fromcif and it crashes, but because it finds a chain id that is incompatible with the PDB format (AA, two characters). We could add an option to force reading label_seq_id and friends (not *auth*) but that's as far as I think we can go with such a rudimentary tool.

Our intention was never to support mmCIF files properly as they are a pain and not very amenable to the type of serial processing PDB files lend themselves very nicely to. I'll keep an eye on your mmcifix tools!

I'll close this issue but feel free to re-open if you think it's worth discussing further!

haddocking / pdb-tools

`ValueError: invalid literal for int() with base 10: '?'` when handling `HETATM` #126