Open Croydon-Brixton opened 1 year ago
EDIT: Looking further it seems the .cif
files provide two chain IDs (_atom_site.auth_asym_id
and _atom_site.label_asym_id
) and the PDBFixer parser has a different preference than the biotite parser.
Is there a way to change the default entry from which PDBFixer reads the chain_id such that it will match with that provided in PDB files?
From the code, it seems that the label_asym_id
and auth_asym_id
columns are chosen based on which one specifies the "most" different chains.
However, the auth_asym_id
column is the one intended to align with what's found in the published literature (and, in my experience, the corresponding PDB file). It is also a mandatory data item, so is guaranteed to always be in a (valid) PDBx/mmCIF file.
By contrast, I've found from recent investigation that the _atom_site.label_*
fields are used primarily as internal relational keys between the different sections of the PDBx/mmCIF file (e.g., to map data between different sections, say anisotropic temperature factors to atoms). For many "unimportant" atoms, like solvent and ions, it is not bothered to assign meaningful values to these atoms.
I personally think the OpenMM PDBx/mmCIF parser should stick to the auth_*
fields where possible.
Thank you for the clarification @swails, this is very helpful.
I agree with you that it would make sense for the default behaviour to stick to the auth_*
fields when possible.
In this case, it seems we would simply need to delete the following lines.
See #194 and #195. We had to make it work that way because neither field consistently identifies chains in all files.
The presence of duplicate auth_
and label_
fields in PDBx/mmCIF is a mess that causes lots of problems. They don't get used consistently in all files. The documentation on them is ambiguous and sometimes contradictory. It also sometimes conflicts with how they're used in files from RCSB.
Interesting... that's too bad. :(
Thank you for providing pdbfixer.
I was using it to fix various protein structures and noticed the following unexpected behaviour (including the code to reproduce):
.cif
or a.pdb
file.biotite
parser.=> Does this suggest that
.cif
files might not parsed correctly wrt. to chain names?Code to reproduce: