[Unexpected behaviour] Unexpected behavoiur when parsing mmCIF files

Croydon-Brixton commented 1 year ago

Thank you for providing pdbfixer.

I was using it to fix various protein structures and noticed the following unexpected behaviour (including the code to reproduce):

The PDBFixer reports a different number of chains and chain names whether I load the same structure from a .cif or a .pdb file.
This doesn't happen for other parsers, for example the biotite parser.

=> Does this suggest that .cif files might not parsed correctly wrt. to chain names?

Code to reproduce:

from pdbfixer import PDBFixer

print("=== As loaded from CIF file ===")
fixer = PDBFixer(filename="2rb2.cif")
print("Chains:", [chain.id for chain in fixer.topology.chains()])
print(fixer.topology.atoms)

print("=== As loaded from PDB file ===")
fixer = PDBFixer(filename="2rb2.pdb")
print("Chains:", [chain.id for chain in fixer.topology.chains()])
print(fixer.topology.atoms)

print("==================================")
# As comparison parse with biotite:
import biotite.structure.io as bsio
import biotite.structure as bs

print(" === COMPARISON: Biotite from CIF ===")
biotite_from_cif = bsio.load_structure("2rb2.cif")
print(biotite_from_cif[:3], f"\nAtoms: {len(biotite_from_cif)}", f"\nChains: {bs.get_chains(biotite_from_cif)}")

print(" === COMPARISON: Biotite from PDB ===")
biotite_from_pdb = bsio.load_structure("2rb2.pdb")
print(biotite_from_pdb[:3], f"\nAtoms: {len(biotite_from_pdb)}", f"\nChains: {bs.get_chains(biotite_from_pdb)}")

Croydon-Brixton commented 1 year ago

EDIT: Looking further it seems the .cif files provide two chain IDs (_atom_site.auth_asym_id and _atom_site.label_asym_id) and the PDBFixer parser has a different preference than the biotite parser.

Is there a way to change the default entry from which PDBFixer reads the chain_id such that it will match with that provided in PDB files?

swails commented 1 year ago

From the code, it seems that the label_asym_id and auth_asym_id columns are chosen based on which one specifies the "most" different chains.

However, the auth_asym_id column is the one intended to align with what's found in the published literature (and, in my experience, the corresponding PDB file). It is also a mandatory data item, so is guaranteed to always be in a (valid) PDBx/mmCIF file.

By contrast, I've found from recent investigation that the _atom_site.label_* fields are used primarily as internal relational keys between the different sections of the PDBx/mmCIF file (e.g., to map data between different sections, say anisotropic temperature factors to atoms). For many "unimportant" atoms, like solvent and ions, it is not bothered to assign meaningful values to these atoms.

I personally think the OpenMM PDBx/mmCIF parser should stick to the auth_* fields where possible.

Croydon-Brixton commented 1 year ago

Thank you for the clarification @swails, this is very helpful.

I agree with you that it would make sense for the default behaviour to stick to the auth_* fields when possible. In this case, it seems we would simply need to delete the following lines.

peastman commented 1 year ago

See #194 and #195. We had to make it work that way because neither field consistently identifies chains in all files.

The presence of duplicate auth_ and label_ fields in PDBx/mmCIF is a mess that causes lots of problems. They don't get used consistently in all files. The documentation on them is ambiguous and sometimes contradictory. It also sometimes conflicts with how they're used in files from RCSB.

swails commented 1 year ago

Interesting... that's too bad. :(

openmm / pdbfixer

[Unexpected behaviour] Unexpected behavoiur when parsing mmCIF files #269