rcsb / mmtf

The specification of the MMTF format for biological structures
http://mmtf.rcsb.org/
44 stars 17 forks source link

chainNameList, chainIdList are limited to 4 characters #37

Open pwrose opened 6 years ago

pwrose commented 6 years ago

For some use cases longer chain names/Ids are required, e.g., to encode the symmetry operator when creating biological assemblies.

It would be best if the chain names/Ids can have a flexible length.

gtauriello commented 6 years ago

One idea there: for decoders it should be fairly simple to accept Array and Binary types interchangeably (at least rcsb/mmtf-cpp doesn't have a problem with it and I consider C++ to be rather strict for types). As such one could relax the typing for lists such that they don't necessarily have to be of Binary type and give the encoders more flexibility. Already now, the encoding strategy for binary formats is not strictly required to be the fixed for decoders to work.

Now for your proposed change this would mean that the encoders would have to become "smarter" and choose an appropriate encoding for the chain names. If there is a reasonable max. chain name length (e.g. <= 4), the binary encoding can be used, and otherwise an Array of String can be used instead.

The alternative of course is to change the spec to be fixed to Array of String, but this would break compatibility with the current spec.

All of this is assuming that noone is currently strictly assuming that chain names are fixed at length 4.

In terms of implementing it, I can only speak for the rcsb/mmtf-cpp library where I don't see any problem with using chain names/ids of variable length.

danpf commented 5 years ago

I support having long chain names... but just for the record, http://mmcif.wwpdb.org/docs/large-pdbx-examples/ suggests that

Chain identifiers of up to 4 characters are permitted. The PDB chain identifier corresponds to the "_atom_site.auth_asym_id" data item.

which is sad.

speleo3 commented 5 years ago

for decoders it should be fairly simple to accept Array and Binary types interchangeably

mmtf-c and simplemmtf-python already supports this. Example:


d = simplemmtf.fetch('1rx1')
d._data['chainNameList'] = ['ABCD', 'EFGHIJKL', 'MNOPQRSTUVWXY', 'Z']
open('foo.mmtf', 'wb').write(d.encode())

The file can be loaded into PyMOL, which uses mmtf-c.

For the record, no length limitations mentioned here: http://mmcif.wwpdb.org/dictionaries/mmcif_mdb.dic/Items/_atom_site.auth_asym_id.html