refresh-bio / FAMSA

Algorithm for ultra-scale multiple sequence alignments (3M protein sequences in 5 minutes and 24 GB of RAM)
GNU General Public License v3.0
150 stars 25 forks source link

Fix * char emitted for unknown residues, emits X instead #15

Closed milot-mirdita closed 4 years ago

milot-mirdita commented 4 years ago

We are currently evaluating replacing clustalo with famsa in the Uniclust/Uniref HHblits database workflow. We aligning nearly 7 million non-singleton clusters of a Uniref clustered to 30% seq.id. with famsa. About 800 MSAs were failing in later stages. After manually looking at a few of those I found that they contained stop codons * and originally Selenocysteine (U) or Pyrrolysine (O). This emits the unknown residue X instead.

The gpu branch of the code also defines this constant, however since I do not have a GPU to test my changes. I did not touch that code.

Alternatively the code could be reworked to also support the three missing residues O, U and J. However, for my purposes, I would prefer to emit X.