Open fgvieira opened 7 years ago
Thanks for reporting this issue.
VSEARCH includes masked regions in the alignment score, as does USEARCH:
http://drive5.com/usearch/manual/masking.html
When searching or clustering, masking is only used to exclude masked words or kmers from the heuristic search procedure. It saves time because lots of repetitive sequences may be excluded.
To make VSEARCH ignore the masked region in the alignment score, you could use the VSEARCH option --hardmask
, which will replace the masked symbols with N's and give them a zero alignment score.
In the example shown almost the entire second sequence is masked and there are actually no unmasked 8-mer words left. In that special situation the kmer-based heuristic is actually bypassed and the query sequence is fully aligned to all the other sequences and the optimal alignment shown is found and reported.
This case shows that VSEARCH should probably just ignore sequences that are almost completely masked (no remaining 8-mers) (or shorter than 8 nucleotides).
Isn't it a bit counter-intuitive that masked regions count for the alignment score? I mean, the main reason people mask them is because they are difficult to align and can give spurious hits; if so, should we trust these regions to compute an alignment score (and assess an alignment)?
cheers,
I agree, but in this case VSEARCH will treat the masked regions similarly to USEARCH for compatibility reasons.
ok... but right now VSEARCH has options for:
--match INT score for match (2)
--mismatch INT score for mismatch (-4)
what about adding an option to set the score of a masked position? By default it would be the same as --match
but the user could decide to ignore them (set it to 0) or treat them as mismatches (set it to -4).
How does VSEARCH scores alignments on masked regions? I would expect it to discard hits between masked regions but, when clustering the sequences:
with the command:
vsearch --cluster_fast test.fas --id 0.95 --msaout -
VSEARCH actually clusters them even though one of them is almost completely masked:
Does this make sense? Or is it a bug? thanks,