torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
666 stars 123 forks source link

VSEARCH clustering sequences with masked regions #265

Open fgvieira opened 7 years ago

fgvieira commented 7 years ago

How does VSEARCH scores alignments on masked regions? I would expect it to discard hits between masked regions but, when clustering the sequences:

>HWI-ST397:349:C2YYMACXX:3:1102:20033:76341
TGCATAATAAGGAAAAGGCAGAACTCTTACTGTATTAGAATTTCAAATGAAAAAGTATATACATTTGTGTGCATATACATATATATATACA
>HWI-ST397:349:C2YYMACXX:3:2103:12330:15129
TGCATATGCATATATATATACACACACCCACGTGTATCTATACTATATAGAATATCTATCTAGTATGCA

with the command: vsearch --cluster_fast test.fas --id 0.95 --msaout -

VSEARCH actually clusters them even though one of them is almost completely masked:

>*HWI-ST397:349:C2YYMACXX:3:1102:20033:76341
TGCATAATAAGGAAAAGGCAGAACTCTTACTGTATTAGAATTTCAAATGAAAAAGTATATACATTTGTGTGCATATACATATATATATACA-----------------------------------------------
>HWI-ST397:349:C2YYMACXX:3:2103:12330:15129
---------------------------------------------------------------------TGCatatgcatatatatatacacacacccacgtgtatctatactatatagaatatctatctaGTATGCA
>consensus
TGCATAATAAGGAAAAGGCAGAACTCTTACTGTATTAGAATTTCAAATGAAAAAGTATATACATTTGTGTGCATATACATATATATATACA+++++++++++++++++++++++++++++++++++++++++++++++

Does this make sense? Or is it a bug? thanks,

torognes commented 7 years ago

Thanks for reporting this issue.

VSEARCH includes masked regions in the alignment score, as does USEARCH:

http://drive5.com/usearch/manual/masking.html

When searching or clustering, masking is only used to exclude masked words or kmers from the heuristic search procedure. It saves time because lots of repetitive sequences may be excluded.

To make VSEARCH ignore the masked region in the alignment score, you could use the VSEARCH option --hardmask, which will replace the masked symbols with N's and give them a zero alignment score.

In the example shown almost the entire second sequence is masked and there are actually no unmasked 8-mer words left. In that special situation the kmer-based heuristic is actually bypassed and the query sequence is fully aligned to all the other sequences and the optimal alignment shown is found and reported.

This case shows that VSEARCH should probably just ignore sequences that are almost completely masked (no remaining 8-mers) (or shorter than 8 nucleotides).

fgvieira commented 7 years ago

Isn't it a bit counter-intuitive that masked regions count for the alignment score? I mean, the main reason people mask them is because they are difficult to align and can give spurious hits; if so, should we trust these regions to compute an alignment score (and assess an alignment)?

cheers,

torognes commented 7 years ago

I agree, but in this case VSEARCH will treat the masked regions similarly to USEARCH for compatibility reasons.

fgvieira commented 7 years ago

ok... but right now VSEARCH has options for:

--match INT                 score for match (2)
--mismatch INT              score for mismatch (-4)

what about adding an option to set the score of a masked position? By default it would be the same as --match but the user could decide to ignore them (set it to 0) or treat them as mismatches (set it to -4).