ihmwg / IHMCIF

đź“– mmCIF support for hybrid/integrative models
https://pdb-dev.wwpdb.org
Creative Commons Zero v1.0 Universal
22 stars 3 forks source link

Atomic multi-state structures break some of the struct_group categories in the PDBx/mmCIF dictionary #70

Open brindakv opened 6 years ago

brindakv commented 6 years ago

Although the atom_site category has been extended in the IHM-dictionary to accommodate compositionally different multi-state structures, the struct_group categories still assume uniform composition across models (e.g. struct_conf, struct_conn). Data categories that are derived from the coordinates in the atom_site category and assume uniform composition will therefore break in case of atomic multi-state structures. This could be addressed either in the PDBx/mmCIF dictionary or in the IHM-dictionary extension.

benmwebb commented 6 years ago

Perhaps I don't understand the issue here, but as far as I can see both struct_conf and struct_conn point to one or more pairs of asym_id/seq_id. I haven't seen any multi-state models (atomic or coarse-grained) where for a given asym_id composition is so different that a given seq_id refers to a different part of the structure in two different states. Why would it? If the sequence is different it would have to be a different entity, and thus a distinct asym_id.

tomgoddard commented 6 years ago

The question is whether mmCIF atom_site allows multiple models which have different sets of atoms. I have never seen such a file. But I agree as long as every model using a specific asym_id refers to the same entity then maybe there is no problem. But I think no mmCIF reader in use today is likely to handle that correctly. If code cannot make that assumption that models contain identical atoms then it needs to check if the different models have identical sets of atoms since connectivity will have to be determined separately for each model in that case.

In the old PDB format the specification explicitly says that multiple models must have identical atoms:

"each model should have the exact same atoms (hydrogen and heavy atoms), sequence and chemistry.”

http://www.wwpdb.org/documentation/file-format-content/format33/sect9.html#MODEL

For the mmCIF format there is not even a field for handling multiple models in atom_site, this is only added by PDBx as the _atom_site.pdbx_PDB_model_num field and the documentation merely says “PDB model number”, so no telling what this means.

http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx.dic/Items/_atom_site.pdbx_PDB_model_num.html
brindakv commented 6 years ago

The problem is that struct-conf, struct_conn and some of the other struct_group categories do not have a data item pointing to the _atom_site.pdbx_PDB_model_num. Therefore, they are only populated for the first model in an ensemble. The assumption of a homogenous ensemble is therefore implicit.

benmwebb commented 6 years ago

They don't need to reference a model number, because a given seq_id/asym_id pair should be valid for all models. I'd assume if you have one model with only chain A in it, and another model with only chain B, your struct_conf would contain entries for both asym_id=A and asym_id=B.