rega-cev / virulign

VIRULIGN: fast codon-correct alignment and annotation of viral genomes
GNU General Public License v2.0
31 stars 12 forks source link

Segmentation fault when > 40 Ns in a sequence #12

Open tseemann opened 5 years ago

tseemann commented 5 years ago

How many N chars in input sequence can virulign tolerate?

plibin-vub commented 5 years ago

Good question.

When I use virulign in my own work, I always remove the n-chars from the input sequence. The idea is that such characters contain only positional information, what you are trying to find by aligning the sequences, and thus can only disrupt the procedure (e.g., in case the n-chars don't make sense). When block's of n-chars are removed, they will be detected as an amino acid gap in the alignment, and single n-chars will be fixed through virulign's frameshift detection step.

I would argue that we give an error message when the sequence contains n-chars, or remove these symbols by default?

tseemann commented 5 years ago

You have a good point - but in this case the alignment DOES have a base, we just don't know which one it is due to poor capillary sequence quality at that part of the virus amplicon. Putting a gap would be incorrect phylogenomically; there is no deletion.

That said, I think some warning or message is critical, and an option to clean the sequences to the state you need them is important.

CC: @schultzm