Open tseemann opened 5 years ago
Good question.
When I use virulign in my own work, I always remove the n-chars from the input sequence. The idea is that such characters contain only positional information, what you are trying to find by aligning the sequences, and thus can only disrupt the procedure (e.g., in case the n-chars don't make sense). When block's of n-chars are removed, they will be detected as an amino acid gap in the alignment, and single n-chars will be fixed through virulign's frameshift detection step.
I would argue that we give an error message when the sequence contains n-chars, or remove these symbols by default?
You have a good point - but in this case the alignment DOES have a base, we just don't know which one it is due to poor capillary sequence quality at that part of the virus amplicon. Putting a gap would be incorrect phylogenomically; there is no deletion.
That said, I think some warning or message is critical, and an option to clean the sequences to the state you need them is important.
CC: @schultzm
How many N chars in input sequence can virulign tolerate?