Closed peterjc closed 1 year ago
Partly answering my own question: Looking at the code, it seems for its UNOISE implementation VSEARCH uses the Needleman-Wunsch aligner in https://github.com/torognes/vsearch/blob/master/src/align.cc and https://github.com/torognes/vsearch/blob/master/src/searchcore.cc as follows:
if (opt_cluster_unoise)
{
int d = hit->mismatches;
double skew = 1.0 * si->qsize / db_getabundance(hit->target);
double beta = 1.0 / pow(2, 1.0 * opt_unoise_alpha * d + 1);
if (skew <= beta || d == 0)
{
/* accepted */
...
Quoting the man-page https://github.com/torognes/vsearch/blob/master/man/vsearch.1 explains much of the scoring:
When comparing sequences during chimera detection, dereplication, searching and clustering, T and U are considered identical, regardless of their case. When aligning sequences, identical symbols will receive a positive match score (default +2). If two symbols are not identical, their alignment result in a negative mismatch score (default -4). Aligning a pair of symbols where at least one of them is an ambiguous symbol (BDHKMNRSVWY) will always result in a score of zero. Alignment of two identical ambiguous symbols (for example, R vs R) also receives a score of zero. When computing the amount of similarity by counting matches and mismatches after alignment, ambiguous nucleotide symbols will count as matching to other symbols if they have at least one of the nucleotides (ACGTU) they may represent in common. For example: W will match A and T, but also any of MRVHDN. When showing alignments (for example with the --alnout option) matches involving ambiguous symbols will be shown with a plus character (+) between them while exact matches between non-ambiguous symbols will be shown with a vertical bar character (|).
This is consistent with https://github.com/torognes/vsearch/blob/master/src/align.cc which also gives the gap penalties:
default (interior) scores:
match: +2
mismatch: -4
gap open: 20
gap extend: 2
i.e. This looks like it does NW alignment, and then considers the mismatch count of the best alignment(s) for the UNOISE distance.
Summary: both USEARCH (assorted versions including UNOISE2 and UNOISE3 algorithms) and VSEARCH behave as if the distance between these sequences is 3. However, the Levenshtein distance is only 2.
Test case, centroid and variant behaving as if edit distance 3 (but Levenshtein distance is 2):
$ cat duo100.fasta
>f1d4c0432c89f005ec2b4e467839f5cc;size=12800
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCTCTCTTTTAAACCCATTCTGTAATACTGAACATACT
>59434599a79bdf1571eb4932d9b0d801;size=100
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCCTACTTTTAAACCCATTCTGTAATACTGAACATACT
$ cat duo101.fasta
>f1d4c0432c89f005ec2b4e467839f5cc;size=12800
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCTCTCTTTTAAACCCATTCTGTAATACTGAACATACT
>59434599a79bdf1571eb4932d9b0d801;size=101
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCCTACTTTTAAACCCATTCTGTAATACTGAACATACT
$ diff duo100.fasta duo101.fasta
3c3
< >59434599a79bdf1571eb4932d9b0d801;size=100
---
> >59434599a79bdf1571eb4932d9b0d801;size=101
UNOISE2 via USEARCH v9 shows critical ratio to be 128:
$ usearch9.1.13_i86linux32 -unoise2 duo100.fasta -minampsize 4 -unoise_alpha 2 -fastaout denoised.fa ; grep '^>' denoised.fa
usearch v9.1.13_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch
License: personal use, non-transferrable
00:00 208Mb 100.0% Word stats
00:00 208Mb 100.0% Alloc rows
00:00 208Mb 100.0% Build index
00:00 178Mb 100.0% Reading duo100.fasta
00:00 148Mb 100.0% Size=100, 1 amplicons, 0 corrected reads
00:00 152Mb Sorting
00:00 149Mb 100.0% 1 good, 0 chimeras
00:00 149Mb 100.0% Writing amplicons
>Otu1;uniq=f1d4c0432c89f005ec2b4e467839f5cc;uniqsize=12800;size=12900;
$ usearch9.1.13_i86linux32 -unoise2 duo101.fasta -minampsize 4 -unoise_alpha 2 -fastaout denoised.fa ; grep '^>' denoised.fa
usearch v9.1.13_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch
License: personal use, non-transferrable
00:00 208Mb 100.0% Word stats
00:00 208Mb 100.0% Alloc rows
00:01 208Mb 100.0% Build index
00:01 178Mb 100.0% Reading duo101.fasta
00:01 148Mb 100.0% Size=101, 1 amplicons, 0 corrected reads
00:01 152Mb Sorting
00:01 152Mb 100.0% 2 good, 0 chimeras
00:01 152Mb 100.0% Writing amplicons
>Otu1;uniq=f1d4c0432c89f005ec2b4e467839f5cc;uniqsize=12800;size=12800;
>Otu2;uniq=59434599a79bdf1571eb4932d9b0d801;uniqsize=101;size=101;
$ usearch9.2.64_i86linux32 -unoise2 duo100.fasta -minampsize 4 -unoise_alpha 2 -fastaout denoised.fa ; grep '^>' denoised.fa
usearch v9.2.64_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch
License: personal use, non-transferrable
00:00 208Mb 100.0% Word stats
00:00 208Mb 100.0% Alloc rows
00:00 208Mb 100.0% Build index
00:00 178Mb 100.0% Reading duo100.fasta
00:00 148Mb 100.0% 1 amplicons, 0 bad (size >= 100)
00:00 153Mb 100.0% 1 good, 0 chimeras
00:00 153Mb 100.0% Writing amplicons
>Otu1;uniq=f1d4c0432c89f005ec2b4e467839f5cc;size=12800;
$ usearch9.2.64_i86linux32 -unoise2 duo101.fasta -minampsize 4 -unoise_alpha 2 -fastaout denoised.fa ; grep '^>' denoised.fa
usearch v9.2.64_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch
License: personal use, non-transferrable
00:00 208Mb 100.0% Word stats
00:00 208Mb 100.0% Alloc rows
00:00 208Mb 100.0% Build index
00:00 178Mb 100.0% Reading duo101.fasta
00:00 148Mb 100.0% 1 amplicons, 0 bad (size >= 101)
00:00 156Mb 100.0% 2 good, 0 chimeras
00:00 156Mb 100.0% Writing amplicons
>Otu1;uniq=f1d4c0432c89f005ec2b4e467839f5cc;size=12800;
>Otu2;uniq=59434599a79bdf1571eb4932d9b0d801;size=101;
And again with UNOISE3 via USEARCH v10 and v11, critical ratio to be 128::
$ usearch10.0.259_i86linux32 -unoise3 duo100.fasta -minsize 4 -unoise_alpha 2 -zotus denoised.fa ; grep '^>' denoised.fa
usearch v10.0.259_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-17 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch
License: personal use, non-transferrable
00:00 40Mb 100.0% Reading duo100.fasta
00:00 11Mb 100.0% 1 amplicons, 0 bad (size >= 100)
00:00 15Mb 100.0% 1 good, 0 chimeras
00:00 15Mb 100.0% Writing zotus
>Zotu1
$ usearch10.0.259_i86linux32 -unoise3 duo101.fasta -minsize 4 -unoise_alpha 2 -zotus denoised.fa ; grep '^>' denoised.fa
usearch v10.0.259_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-17 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch
License: personal use, non-transferrable
00:00 40Mb 100.0% Reading duo101.fasta
00:00 11Mb 100.0% 1 amplicons, 0 bad (size >= 101)
00:00 18Mb 100.0% 2 good, 0 chimeras
00:00 18Mb 100.0% Writing zotus
>Zotu1
>Zotu2
$ usearch11.0.667_i86linux32 -unoise3 duo100.fasta -minsize 4 -unoise_alpha 2 -zotus denoised.fa ; grep '^>' denoised.fa
usearch v11.0.667_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch
License: personal use only
00:00 41Mb 100.0% Reading duo100.fasta
00:00 11Mb 100.0% 1 amplicons, 0 bad (size >= 100)
00:00 16Mb 100.0% 1 good, 0 chimeras
00:00 16Mb 100.0% Writing zotus
>Zotu1
$ usearch11.0.667_i86linux32 -unoise3 duo101.fasta -minsize 4 -unoise_alpha 2 -zotus denoised.fa ; grep '^>' denoised.fa
usearch v11.0.667_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch
License: personal use only
00:00 41Mb 100.0% Reading duo101.fasta
00:00 11Mb 100.0% 1 amplicons, 0 bad (size >= 101)
00:00 19Mb 100.0% 2 good, 0 chimeras
00:00 19Mb 100.0% Writing zotus
>Zotu1
>Zotu2
Again with VSEARCH,
$ vsearch --unoise_alpha 2 --minsize 4 --cluster_unoise duo100.fasta --centroids vsearch.fasta --sizein --sizeout --sizeorder ; grep "^>" vsearch.fasta
vsearch v2.21.2_linux_x86_64, 31.4GB RAM, 12 cores
https://github.com/torognes/vsearch
Reading file duo100.fasta 100%
460 nt in 2 seqs, min 230, max 230, avg 230
Masking 100%
Sorting by abundance 100%
Counting k-mers 100%
Clustering 100%
Sorting clusters 100%
Writing clusters 100%
Clusters: 1 Size min 12900, max 12900, avg 2.0
Singletons: 0, 0.0% of seqs, 0.0% of clusters
>f1d4c0432c89f005ec2b4e467839f5cc;size=12900
$ vsearch --unoise_alpha 2 --minsize 4 --cluster_unoise duo101.fasta --centroids vsearch.fasta --sizein --sizeout --sizeorder ; grep "^>" vsearch.fasta
vsearch v2.21.2_linux_x86_64, 31.4GB RAM, 12 cores
https://github.com/torognes/vsearch
Reading file duo101.fasta 100%
460 nt in 2 seqs, min 230, max 230, avg 230
Masking 100%
Sorting by abundance 100%
Counting k-mers 100%
Clustering 100%
Sorting clusters 100%
Writing clusters 100%
Clusters: 2 Size min 101, max 12800, avg 1.0
Singletons: 0, 0.0% of seqs, 0.0% of clusters
>f1d4c0432c89f005ec2b4e467839f5cc;size=12800
>59434599a79bdf1571eb4932d9b0d801;size=101
i.e. Ratio of exactly 1:128 merged (one centroid), but if the variant is more abundant and thus over this ratio, not merged (two centroids)
This would be fine if dist=3, since beta function uses 2^(alpha*dist+1) = 2^7 = 128
However, Levenshtein distance is 2, giving 2^(alpha*dist+1) = 2^5 = 32, meaning abundance threshold for not-merging is much higher at abundance 400 (since 12800 / 400 = 32).
I emailed Robert Edgar about the discrepancy between USEARCH and my implementation. He'd not looked at this code for a while but said it was using Levenstein distance, but with some heuristics on top of the basic algorithm. I've not tried comparing the USEARCH and VSEARCH implementations against each other to say how well it mimics those heuristics.
Hi
Sorry for the late reply. I actually wrote an answer two weeks ago, but it must have disappeared for some reason.
I think you are right that vsearch does not use Levenshtein distance, but rather the number of mismatches. I cannot answer for usearch, but we have tried to make vsearch as similar as possible in such respects.
You can actually view the alignment generated by vsearch using the option --alnout filename
. I've done that for your example, and as you see, in this case the alignment contains 3 mismatches instead of 2 indels:
$ more vsearch.alnout
vsearch --unoise_alpha 2 --minsize 4 --cluster_unoise duo100.fasta --centroids vsearch.fasta --sizein --sizeout --sizeorder --alnout vsearch.alnout
vsearch v2.22.1_macos_x86_64, 16.0GB RAM, 8 cores
Query >59434599a79bdf1571eb4932d9b0d801;size=100
%Id TLen Target
99% 230 f1d4c0432c89f005ec2b4e467839f5cc;size=12800
Query 230nt >59434599a79bdf1571eb4932d9b0d801;size=100
Target 230nt >f1d4c0432c89f005ec2b4e467839f5cc;size=12800
Qry 1 + TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTA 64
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Tgt 1 + TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTA 64
Qry 65 + TCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCT 128
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Tgt 65 + TCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCT 128
Qry 129 + GCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAG 192
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Tgt 129 + GCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAG 192
Qry 193 + CCCTACTTTTAAACCCATTCTGTAATACTGAACATACT 230
|| |||||||||||||||||||||||||||||||||
Tgt 193 + CCTCTCTTTTAAACCCATTCTGTAATACTGAACATACT 230
230 cols, 227 ids (98.7%), 0 gaps (0.0%)
The difference relative to Needle is due to a somewhat different scoring system (compared to Needle, vsearch uses higher gap penalties relative to mismatch penalties). This should hopefully clarify at least part of your question.
Thank you.
Given how similarly the USEARCH and VSEARCH implementations of UNOISE seem to behave on this and my other examples (impressive reverse engineering there!), the simplest explanation is while Robert Edgar's initial implementation may have used Levenshtein distance as in Edgar 2016 manuscript https://www.biorxiv.org/content/10.1101/081257v1.full given his modifications since, it is now closer to a substitution count distance.
I think we can close this issue now (unless you want to add anything to the VSEARCH documentation about this point?).
Thanks. I'll add a note about this issue in the vsearch manual.
Added note in manual about cluster_unoise
and the calculations.
The Edgar 2016 paper says UNOISE2 used the Levenshtein distance: https://www.biorxiv.org/content/10.1101/081257v1.full
Comment https://github.com/torognes/vsearch/pull/283#issuecomment-360850868 on the original pull request adding UNOISE functionality to VSEARCH suggests from empirical testing that the USEARCH implementation of UNOISE2 and UNOISE3 was not using Levenshtein distance, but counting substitutions.
Do you recall how you came to that conclusion @davidealbanese?
I've been trying to validate my own UNOISE implementation by cross checking against VSEARCH (and USEARCH), from which I have this minimal test case.
Consider the following pair of test cases, where I am expecting in both cases denoising to merge the low abundance variant with the high abundance centroid.
In the above example with ratio at exactly 1:128, the variant gets merged into the more common sequence. However, if we increase the variant's abundance by one it does not:
This ratio of 1:128 is a threshold in the UNOISE beta function with dist=3, alpha=2.
This is behaving as though VSEARCH thinks the distance between these two sequences is 3, giving a beta function value 1 / 2^(alphadist+1) = 1 / 2^(23+1) = 1 / 2^7 = 1 / 128.
However while this could be viewed as a difference of three substitutions, I think the Levenshtein distance is only 2, two single base inserts - see for example this EMBOSS needle alignment:
This does match the UNOISE2 behaviour, ... <-- Edit, I need to double check this
Note that in general the two sequences need not be the same length, so I am unclear what exactly is being used as the distance rule.