project-gemmi / gemmi

macromolecular crystallography library and utilities
https://project-gemmi.github.io/
Mozilla Public License 2.0
205 stars 42 forks source link

[Bug Report] read_pdb mishandle the columns of extreme wild pdb files #313

Closed minhuanli closed 1 month ago

minhuanli commented 1 month ago

Hi,

This is a bit of an unusual situation. Lately, we've been tinkering with AlphaFold, and it's producing some pretty out-there results. Sometimes the output models are just plain wild and nonsensical, like what I'll show you below:

CRYST1   74.391   95.432  115.502  90.00  90.00  90.00 P 21 21 21               
ATOM      1  N   THR A  11    31256.96532685.06460403.930  1.00999.99           
ATOM      2  CA  THR A  11    31257.48832684.45960405.148  1.00999.99           
ATOM      3  C   THR A  11    31257.42032682.93860405.070  1.00999.99           
...

The columns are all squished up, so when I used gemmi.read_pdb to parse, it might get confused and mess up the positions, occupancies, and isotropic B factors.

>>>  import gemmi
>>>  gemmi.__version__
'0.6.2'
>>> test = gemmi.read_pdb("gemmi_bug_report_wild_protein.pdb"))
>>> test[0][0][0][0].pos.x, test[0][0][0][0].pos.y, test[0][0][0][0].pos.z
(31256.96, 532685.0, 6460403.0)
# Should be (31256.965, 32685.064, 60403.930), each with 3 digits precision
>>> test[0][0][0][0].occ
930.0
# should be 1.00
>>> test[0][0][0][0].b_iso
0.009990000165998936
# should be 999.99

For our purposes, and I reckon for others dealing with predicted models too, it's not a big deal if we get some wonky atom positions in these extreme cases. They're already so far off and don't mean much. But having incorrectly parsed B factors is a real issue because they usually reflect the model's confidence (usually people convert plddt into a pseudoBs). So here getting a 0.009 instead of 999.99 is not ideal.

I think this could be fixed by assuming 3 digits precision during the parsing, but not quite sure if this assumption will hold generally.

Also I attached the wild pdb file for your tests and development (sorry have to change ext to be txt or the upload was prohibited.)

gemmi_bug_report_wild_protein.txt

wojdyr commented 1 month ago

I think it's safer to stick to the columns in given in the PDB spec. For example, B-factor is expected in columns 61-66, so if a file has .00999 there – that's what is read. Perhaps something could be done about writing such files. Was this file converted from a publicly available mmCIF?

minhuanli commented 1 month ago

Thanks for the reply. Yeah, I see what you mean.

Actually the pdb file was written out with gemmi. I wrote my own util function to convert arrays into a gemmi structure object and save out using write_pdb. The positions and B factors in those arrays and the gemmi.structure object were correct before saving them out.

wojdyr commented 1 month ago

Yes, there is no check for overflow of the columns when writing coordinates. I'll add it. But I wonder what to when the numbers don't fit.

One options is to trim last digits (for instance the last digit in 31256.965). It wouldn't be fully spec-conforming, because the number should be Real(8.3), but as close as possible.

Another option is to stop with error. In your example, and I suppose in all such cases, it's possible to shift the model toward zero. Having smaller numbers could avoid other problems down the road, so it could be better to stop and inform the user about it.

Was it AF that produced such coordinates?

minhuanli commented 1 month ago

Yes, the protein structure was predicted using AlphaFold, but we implemented some hacky procedures during its forward pass to guide the prediction. Given the widespread experimentation with AlphaFold methodologies, it's foreseeable that more instances of such non-standard models will arise in the future.

In our particular application, we aim to avoid computational interruptions due to errors. However, it would be beneficial to implement warning mechanisms for detecting these scenarios. Either rounding off the last significant digits or diminishing the model coordinates sounds good to me. As long as B factors are maintained at 999.99, future analysis pipelines will know this is an unreliable predictions.

wojdyr commented 1 month ago

In the end I did this:

One options is to trim last digits (for instance the last digit in 31256.965). It wouldn't be fully spec-conforming, because the number should be Real(8.3), but as close as possible.