rcsb / mmtf

The specification of the MMTF format for biological structures
http://mmtf.rcsb.org/
44 stars 17 forks source link

Chemical component identifier lost for unobserved non-standard residues #43

Open josemduarte opened 5 years ago

josemduarte commented 5 years ago

Since mmtf stores the SEQRES groups as 1-letter code strings, the chemical component id for any residue that is non-standard and happens to be unobserved will be lost. E.g. for 2X3T chain E (a glycopeptide) contains several unobserved non-standard aminoacids that are represented like "KXXXXXXEX". For groups that are observed, the chemical component identifier is recoverable from the ATOM information, but not for those that are unobserved.

josemduarte commented 5 years ago

A possible solution proposed by @pwrose is to store the full chemical component ID with the group data for unobserved residues here: https://github.com/rcsb/mmtf/blob/master/spec.md#group-data

However, that requires either a new flag observed y/n or making the formalChargeList, elementList and atomNameList optional fields (now they are required).

gtauriello commented 5 years ago

Given that the group data in MMTF only lists the observed atoms, I would say that an unobserved residue could be represented with a group which has 0-length arrays formalChargeList, atomNameList and elementList. I don't see a problem with those arrays being empty. At least the C++ decoder/encoder shouldn't have any issues with it.

Given that the fields are required, the arrays should always be written in the MMTF file, but there is no problem with writing 0-length arrays in msgpack.