ihmwg / ModelCIF

mmCIF-based extension dictionary for computed structure models
Creative Commons Zero v1.0 Universal
17 stars 4 forks source link

How should UNK residues be handled in ma_alignment.sequence? #2

Open benmwebb opened 3 years ago

benmwebb commented 3 years ago

ma_alignment.sequence is described as "The target / template sequence in the multiple sequence alignment". But what should this look like if the target or template contains non-standard residues such as UNK? For example we have a model built using 4buj chain E as the template which contains a number of UNK residues. Should ma_alignment.sequence here contain X (to match entity_poly.pdbx_seq_one_letter_code_can in 4buj.cif) or (UNK) (as in entity_poly.pdbx_seq_one_letter_code) ? The latter seems more flexible but would require reader software to be a little more intelligent (since it can't assume one character = one alignment position). But since the sequence is already uniquely defined elsewhere it seems like it doesn't matter either way, just as long as it is defined.

brindakv commented 3 years ago

I suggest using X instead of (UNK) so that the sequence is a string of one-letter codes as defined in the dictionary.

benmwebb commented 3 years ago

I suggest using X instead of (UNK) so that the sequence is a string of one-letter codes as defined in the dictionary.

Works for me - so, the canonical sequence. Can this be stated in the docs then? That should reduce the possibility of people producing files with (UNK) and friends instead.