Open pwrose opened 6 years ago
One idea there: for decoders it should be fairly simple to accept Array
and Binary
types interchangeably (at least rcsb/mmtf-cpp doesn't have a problem with it and I consider C++ to be rather strict for types). As such one could relax the typing for lists such that they don't necessarily have to be of Binary
type and give the encoders more flexibility. Already now, the encoding strategy for binary formats is not strictly required to be the fixed for decoders to work.
Now for your proposed change this would mean that the encoders would have to become "smarter" and choose an appropriate encoding for the chain names. If there is a reasonable max. chain name length (e.g. <= 4), the binary encoding can be used, and otherwise an Array
of String
can be used instead.
The alternative of course is to change the spec to be fixed to Array
of String
, but this would break compatibility with the current spec.
All of this is assuming that noone is currently strictly assuming that chain names are fixed at length 4.
In terms of implementing it, I can only speak for the rcsb/mmtf-cpp library where I don't see any problem with using chain names/ids of variable length.
I support having long chain names... but just for the record, http://mmcif.wwpdb.org/docs/large-pdbx-examples/ suggests that
Chain identifiers of up to 4 characters are permitted. The PDB chain identifier corresponds to the "_atom_site.auth_asym_id" data item.
which is sad.
for decoders it should be fairly simple to accept Array and Binary types interchangeably
mmtf-c and simplemmtf-python already supports this. Example:
d = simplemmtf.fetch('1rx1')
d._data['chainNameList'] = ['ABCD', 'EFGHIJKL', 'MNOPQRSTUVWXY', 'Z']
open('foo.mmtf', 'wb').write(d.encode())
The file can be loaded into PyMOL, which uses mmtf-c.
For the record, no length limitations mentioned here: http://mmcif.wwpdb.org/dictionaries/mmcif_mdb.dic/Items/_atom_site.auth_asym_id.html
For some use cases longer chain names/Ids are required, e.g., to encode the symmetry operator when creating biological assemblies.
It would be best if the chain names/Ids can have a flexible length.