Closed minhuanli closed 6 months ago
I think it's safer to stick to the columns in given in the PDB spec. For example, B-factor is expected in columns 61-66, so if a file has .00999
there – that's what is read.
Perhaps something could be done about writing such files.
Was this file converted from a publicly available mmCIF?
Thanks for the reply. Yeah, I see what you mean.
Actually the pdb file was written out with gemmi
. I wrote my own util function to convert arrays into a gemmi structure object and save out using write_pdb
. The positions and B factors in those arrays and the gemmi.structure object were correct before saving them out.
Yes, there is no check for overflow of the columns when writing coordinates. I'll add it. But I wonder what to when the numbers don't fit.
One options is to trim last digits (for instance the last digit in 31256.965
). It wouldn't be fully spec-conforming, because the number should be Real(8.3), but as close as possible.
Another option is to stop with error. In your example, and I suppose in all such cases, it's possible to shift the model toward zero. Having smaller numbers could avoid other problems down the road, so it could be better to stop and inform the user about it.
Was it AF that produced such coordinates?
Yes, the protein structure was predicted using AlphaFold, but we implemented some hacky procedures during its forward pass to guide the prediction. Given the widespread experimentation with AlphaFold methodologies, it's foreseeable that more instances of such non-standard models will arise in the future.
In our particular application, we aim to avoid computational interruptions due to errors. However, it would be beneficial to implement warning mechanisms for detecting these scenarios. Either rounding off the last significant digits or diminishing the model coordinates sounds good to me. As long as B factors are maintained at 999.99, future analysis pipelines will know this is an unreliable predictions.
In the end I did this:
One options is to trim last digits (for instance the last digit in 31256.965). It wouldn't be fully spec-conforming, because the number should be Real(8.3), but as close as possible.
Hi,
This is a bit of an unusual situation. Lately, we've been tinkering with AlphaFold, and it's producing some pretty out-there results. Sometimes the output models are just plain wild and nonsensical, like what I'll show you below:
The columns are all squished up, so when I used
gemmi.read_pdb
to parse, it might get confused and mess up the positions, occupancies, and isotropic B factors.For our purposes, and I reckon for others dealing with predicted models too, it's not a big deal if we get some wonky atom positions in these extreme cases. They're already so far off and don't mean much. But having incorrectly parsed B factors is a real issue because they usually reflect the model's confidence (usually people convert
plddt
into a pseudoBs). So here getting a 0.009 instead of 999.99 is not ideal.I think this could be fixed by assuming 3 digits precision during the parsing, but not quite sure if this assumption will hold generally.
Also I attached the wild pdb file for your tests and development (sorry have to change ext to be txt or the upload was prohibited.)
gemmi_bug_report_wild_protein.txt