torognes / vsearch

Versatile open-source tool for microbiome analysis
Other
643 stars 123 forks source link

UNOISE not using Levenshtein distance? #503

Closed peterjc closed 1 year ago

peterjc commented 1 year ago

The Edgar 2016 paper says UNOISE2 used the Levenshtein distance: https://www.biorxiv.org/content/10.1101/081257v1.full

Comment https://github.com/torognes/vsearch/pull/283#issuecomment-360850868 on the original pull request adding UNOISE functionality to VSEARCH suggests from empirical testing that the USEARCH implementation of UNOISE2 and UNOISE3 was not using Levenshtein distance, but counting substitutions.

Do you recall how you came to that conclusion @davidealbanese?

I've been trying to validate my own UNOISE implementation by cross checking against VSEARCH (and USEARCH), from which I have this minimal test case.

Consider the following pair of test cases, where I am expecting in both cases denoising to merge the low abundance variant with the high abundance centroid.

$ cat duo100.fasta
>f1d4c0432c89f005ec2b4e467839f5cc;size=12800
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCTCTCTTTTAAACCCATTCTGTAATACTGAACATACT
>59434599a79bdf1571eb4932d9b0d801;size=100
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCCTACTTTTAAACCCATTCTGTAATACTGAACATACT
$ vsearch --unoise_alpha 2 --minsize 4 --cluster_unoise duo100.fasta --centroids vsearch.fasta --sizein --sizeout --sizeorder ; cat vsearch.fasta 
vsearch v2.22.1_macos_x86_64, 8.0GB RAM, 8 cores
https://github.com/torognes/vsearch

Reading file duo100.fasta 100%  
460 nt in 2 seqs, min 230, max 230, avg 230
Masking 100% 
Sorting by abundance 100%
Counting k-mers 100% 
Clustering 100%  
Sorting clusters 100%
Writing clusters 100% 
Clusters: 1 Size min 12900, max 12900, avg 2.0
Singletons: 0, 0.0% of seqs, 0.0% of clusters
>f1d4c0432c89f005ec2b4e467839f5cc;size=12900
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTG
GGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGT
TTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCTCTCTTTTAAACCCATTCTGTAATACTGAACATACT

In the above example with ratio at exactly 1:128, the variant gets merged into the more common sequence. However, if we increase the variant's abundance by one it does not:

$ cat duo101.fasta
>f1d4c0432c89f005ec2b4e467839f5cc;size=12800
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCTCTCTTTTAAACCCATTCTGTAATACTGAACATACT
>59434599a79bdf1571eb4932d9b0d801;size=101
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCCTACTTTTAAACCCATTCTGTAATACTGAACATACT
$ diff duo100.fasta duo101.fasta 
3c3
< >59434599a79bdf1571eb4932d9b0d801;size=100
---
> >59434599a79bdf1571eb4932d9b0d801;size=101
$ vsearch --unoise_alpha 2 --minsize 4 --cluster_unoise duo101.fasta --centroids vsearch.fasta --sizein --sizeout --sizeorder ; cat vsearch.fasta 
vsearch v2.22.1_macos_x86_64, 8.0GB RAM, 8 cores
https://github.com/torognes/vsearch

Reading file duo101.fasta 100%  
460 nt in 2 seqs, min 230, max 230, avg 230
Masking 100% 
Sorting by abundance 100%
Counting k-mers 100% 
Clustering 100%  
Sorting clusters 100%
Writing clusters 100% 
Clusters: 2 Size min 101, max 12800, avg 1.0
Singletons: 0, 0.0% of seqs, 0.0% of clusters
>f1d4c0432c89f005ec2b4e467839f5cc;size=12800
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTG
GGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGT
TTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCTCTCTTTTAAACCCATTCTGTAATACTGAACATACT
>59434599a79bdf1571eb4932d9b0d801;size=101
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTG
GGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGT
TTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCCTACTTTTAAACCCATTCTGTAATACTGAACATACT

This ratio of 1:128 is a threshold in the UNOISE beta function with dist=3, alpha=2.

This is behaving as though VSEARCH thinks the distance between these two sequences is 3, giving a beta function value 1 / 2^(alphadist+1) = 1 / 2^(23+1) = 1 / 2^7 = 1 / 128.

However while this could be viewed as a difference of three substitutions, I think the Levenshtein distance is only 2, two single base inserts - see for example this EMBOSS needle alignment:

########################################
# Program: needle
# Rundate: Fri 13 Jan 2023 16:12:02
# Commandline: needle
#    -auto
#    -stdout
#    -asequence emboss_needle-I20230113-162416-0440-37047040-p2m.asequence
#    -bsequence emboss_needle-I20230113-162416-0440-37047040-p2m.bsequence
#    -datafile EDNAFULL
#    -gapopen 10.0
#    -gapextend 0.5
#    -endopen 10.0
#    -endextend 0.5
#    -aformat3 pair
#    -snucleotide1
#    -snucleotide2
# Align_format: pair
# Report_file: stdout
########################################

#=======================================
#
# Aligned_sequences: 2
# 1: f1d4c0432c89f005ec2b4e467839f5cc_5689583
# 2: 59434599a79bdf1571eb4932d9b0d801_44449
# Matrix: EDNAFULL
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 231
# Identity:     229/231 (99.1%)
# Similarity:   229/231 (99.1%)
# Gaps:           2/231 ( 0.9%)
# Score: 1125.0
# 
#
#=======================================

f1d4c0432c89f      1 TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTT     50
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
59434599a79bd      1 TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTT     50

f1d4c0432c89f     51 CCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCG    100
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
59434599a79bd     51 CCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCG    100

f1d4c0432c89f    101 GCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCAC    150
                     ||||||||||||||||||||||||||||||||||||||||||||||||||
59434599a79bd    101 GCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCAC    150

f1d4c0432c89f    151 TGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCTCT-CT    199
                     |||||||||||||||||||||||||||||||||||||||||||| || ||
59434599a79bd    151 TGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCC-CTACT    199

f1d4c0432c89f    200 TTTAAACCCATTCTGTAATACTGAACATACT    230
                     |||||||||||||||||||||||||||||||
59434599a79bd    200 TTTAAACCCATTCTGTAATACTGAACATACT    230

#---------------------------------------
#---------------------------------------

This does match the UNOISE2 behaviour, ... <-- Edit, I need to double check this

Note that in general the two sequences need not be the same length, so I am unclear what exactly is being used as the distance rule.

peterjc commented 1 year ago

Partly answering my own question: Looking at the code, it seems for its UNOISE implementation VSEARCH uses the Needleman-Wunsch aligner in https://github.com/torognes/vsearch/blob/master/src/align.cc and https://github.com/torognes/vsearch/blob/master/src/searchcore.cc as follows:

      if (opt_cluster_unoise)
        {
          int d = hit->mismatches;
          double skew = 1.0 * si->qsize / db_getabundance(hit->target);
          double beta = 1.0 / pow(2, 1.0 * opt_unoise_alpha * d + 1);

          if (skew <= beta || d == 0)
            {
              /* accepted */
              ...

Quoting the man-page https://github.com/torognes/vsearch/blob/master/man/vsearch.1 explains much of the scoring:

When comparing sequences during chimera detection, dereplication, searching and clustering, T and U are considered identical, regardless of their case. When aligning sequences, identical symbols will receive a positive match score (default +2). If two symbols are not identical, their alignment result in a negative mismatch score (default -4). Aligning a pair of symbols where at least one of them is an ambiguous symbol (BDHKMNRSVWY) will always result in a score of zero. Alignment of two identical ambiguous symbols (for example, R vs R) also receives a score of zero. When computing the amount of similarity by counting matches and mismatches after alignment, ambiguous nucleotide symbols will count as matching to other symbols if they have at least one of the nucleotides (ACGTU) they may represent in common. For example: W will match A and T, but also any of MRVHDN. When showing alignments (for example with the --alnout option) matches involving ambiguous symbols will be shown with a plus character (+) between them while exact matches between non-ambiguous symbols will be shown with a vertical bar character (|).

This is consistent with https://github.com/torognes/vsearch/blob/master/src/align.cc which also gives the gap penalties:

  default (interior) scores:
  match: +2
  mismatch: -4
  gap open: 20
  gap extend: 2

i.e. This looks like it does NW alignment, and then considers the mismatch count of the best alignment(s) for the UNOISE distance.

peterjc commented 1 year ago

Summary: both USEARCH (assorted versions including UNOISE2 and UNOISE3 algorithms) and VSEARCH behave as if the distance between these sequences is 3. However, the Levenshtein distance is only 2.


Test case, centroid and variant behaving as if edit distance 3 (but Levenshtein distance is 2):

$ cat duo100.fasta 
>f1d4c0432c89f005ec2b4e467839f5cc;size=12800
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCTCTCTTTTAAACCCATTCTGTAATACTGAACATACT
>59434599a79bdf1571eb4932d9b0d801;size=100
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCCTACTTTTAAACCCATTCTGTAATACTGAACATACT
$ cat duo101.fasta 
>f1d4c0432c89f005ec2b4e467839f5cc;size=12800
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCTCTCTTTTAAACCCATTCTGTAATACTGAACATACT
>59434599a79bdf1571eb4932d9b0d801;size=101
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCCTACTTTTAAACCCATTCTGTAATACTGAACATACT
$ diff duo100.fasta duo101.fasta 
3c3
< >59434599a79bdf1571eb4932d9b0d801;size=100
---
> >59434599a79bdf1571eb4932d9b0d801;size=101

UNOISE2 via USEARCH v9 shows critical ratio to be 128:

$ usearch9.1.13_i86linux32 -unoise2 duo100.fasta -minampsize 4 -unoise_alpha 2 -fastaout denoised.fa ; grep '^>' denoised.fa
usearch v9.1.13_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: personal use, non-transferrable

00:00 208Mb   100.0% Word stats
00:00 208Mb   100.0% Alloc rows
00:00 208Mb   100.0% Build index
00:00 178Mb   100.0% Reading duo100.fasta
00:00 148Mb   100.0% Size=100, 1 amplicons, 0 corrected reads  
00:00 152Mb  Sorting                                         
00:00 149Mb   100.0% 1 good, 0 chimeras
00:00 149Mb   100.0% Writing amplicons 
>Otu1;uniq=f1d4c0432c89f005ec2b4e467839f5cc;uniqsize=12800;size=12900;

$ usearch9.1.13_i86linux32 -unoise2 duo101.fasta -minampsize 4 -unoise_alpha 2 -fastaout denoised.fa ; grep '^>' denoised.fa
usearch v9.1.13_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: personal use, non-transferrable

00:00 208Mb   100.0% Word stats
00:00 208Mb   100.0% Alloc rows
00:01 208Mb   100.0% Build index
00:01 178Mb   100.0% Reading duo101.fasta
00:01 148Mb   100.0% Size=101, 1 amplicons, 0 corrected reads  
00:01 152Mb  Sorting                                         
00:01 152Mb   100.0% 2 good, 0 chimeras
00:01 152Mb   100.0% Writing amplicons 
>Otu1;uniq=f1d4c0432c89f005ec2b4e467839f5cc;uniqsize=12800;size=12800;
>Otu2;uniq=59434599a79bdf1571eb4932d9b0d801;uniqsize=101;size=101;

$ usearch9.2.64_i86linux32 -unoise2 duo100.fasta -minampsize 4 -unoise_alpha 2 -fastaout denoised.fa ; grep '^>' denoised.fa
usearch v9.2.64_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: personal use, non-transferrable

00:00 208Mb   100.0% Word stats
00:00 208Mb   100.0% Alloc rows
00:00 208Mb   100.0% Build index
00:00 178Mb   100.0% Reading duo100.fasta
00:00 148Mb   100.0% 1 amplicons, 0 bad (size >= 100)  
00:00 153Mb   100.0% 1 good, 0 chimeras              
00:00 153Mb   100.0% Writing amplicons 
>Otu1;uniq=f1d4c0432c89f005ec2b4e467839f5cc;size=12800;

$ usearch9.2.64_i86linux32 -unoise2 duo101.fasta -minampsize 4 -unoise_alpha 2 -fastaout denoised.fa ; grep '^>' denoised.fa
usearch v9.2.64_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: personal use, non-transferrable

00:00 208Mb   100.0% Word stats
00:00 208Mb   100.0% Alloc rows
00:00 208Mb   100.0% Build index
00:00 178Mb   100.0% Reading duo101.fasta
00:00 148Mb   100.0% 1 amplicons, 0 bad (size >= 101)  
00:00 156Mb   100.0% 2 good, 0 chimeras              
00:00 156Mb   100.0% Writing amplicons 
>Otu1;uniq=f1d4c0432c89f005ec2b4e467839f5cc;size=12800;
>Otu2;uniq=59434599a79bdf1571eb4932d9b0d801;size=101;

And again with UNOISE3 via USEARCH v10 and v11, critical ratio to be 128::

$ usearch10.0.259_i86linux32 -unoise3 duo100.fasta -minsize 4 -unoise_alpha 2 -zotus denoised.fa ; grep '^>' denoised.fa
usearch v10.0.259_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-17 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: personal use, non-transferrable

00:00 40Mb    100.0% Reading duo100.fasta
00:00 11Mb    100.0% 1 amplicons, 0 bad (size >= 100)  
00:00 15Mb    100.0% 1 good, 0 chimeras              
00:00 15Mb    100.0% Writing zotus     
>Zotu1

$ usearch10.0.259_i86linux32 -unoise3 duo101.fasta -minsize 4 -unoise_alpha 2 -zotus denoised.fa ; grep '^>' denoised.fa
usearch v10.0.259_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-17 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: personal use, non-transferrable

00:00 40Mb    100.0% Reading duo101.fasta
00:00 11Mb    100.0% 1 amplicons, 0 bad (size >= 101)  
00:00 18Mb    100.0% 2 good, 0 chimeras              
00:00 18Mb    100.0% Writing zotus     
>Zotu1
>Zotu2

$ usearch11.0.667_i86linux32 -unoise3 duo100.fasta -minsize 4 -unoise_alpha 2 -zotus denoised.fa ; grep '^>' denoised.fa
usearch v11.0.667_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch

License: personal use only

00:00 41Mb    100.0% Reading duo100.fasta
00:00 11Mb    100.0% 1 amplicons, 0 bad (size >= 100)  
00:00 16Mb    100.0% 1 good, 0 chimeras              
00:00 16Mb    100.0% Writing zotus     
>Zotu1

$ usearch11.0.667_i86linux32 -unoise3 duo101.fasta -minsize 4 -unoise_alpha 2 -zotus denoised.fa ; grep '^>' denoised.fa
usearch v11.0.667_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch

License: personal use only

00:00 41Mb    100.0% Reading duo101.fasta
00:00 11Mb    100.0% 1 amplicons, 0 bad (size >= 101)  
00:00 19Mb    100.0% 2 good, 0 chimeras              
00:00 19Mb    100.0% Writing zotus     
>Zotu1
>Zotu2

Again with VSEARCH,

$ vsearch --unoise_alpha 2 --minsize 4 --cluster_unoise duo100.fasta --centroids vsearch.fasta --sizein --sizeout --sizeorder ; grep "^>" vsearch.fasta
vsearch v2.21.2_linux_x86_64, 31.4GB RAM, 12 cores
https://github.com/torognes/vsearch

Reading file duo100.fasta 100%  
460 nt in 2 seqs, min 230, max 230, avg 230
Masking 100% 
Sorting by abundance 100%
Counting k-mers 100% 
Clustering 100%  
Sorting clusters 100%
Writing clusters 100% 
Clusters: 1 Size min 12900, max 12900, avg 2.0
Singletons: 0, 0.0% of seqs, 0.0% of clusters
>f1d4c0432c89f005ec2b4e467839f5cc;size=12900

$ vsearch --unoise_alpha 2 --minsize 4 --cluster_unoise duo101.fasta --centroids vsearch.fasta --sizein --sizeout --sizeorder ; grep "^>" vsearch.fasta
vsearch v2.21.2_linux_x86_64, 31.4GB RAM, 12 cores
https://github.com/torognes/vsearch

Reading file duo101.fasta 100%  
460 nt in 2 seqs, min 230, max 230, avg 230
Masking 100% 
Sorting by abundance 100%
Counting k-mers 100% 
Clustering 100%  
Sorting clusters 100%
Writing clusters 100% 
Clusters: 2 Size min 101, max 12800, avg 1.0
Singletons: 0, 0.0% of seqs, 0.0% of clusters
>f1d4c0432c89f005ec2b4e467839f5cc;size=12800
>59434599a79bdf1571eb4932d9b0d801;size=101

i.e. Ratio of exactly 1:128 merged (one centroid), but if the variant is more abundant and thus over this ratio, not merged (two centroids)

This would be fine if dist=3, since beta function uses 2^(alpha*dist+1) = 2^7 = 128

However, Levenshtein distance is 2, giving 2^(alpha*dist+1) = 2^5 = 32, meaning abundance threshold for not-merging is much higher at abundance 400 (since 12800 / 400 = 32).

peterjc commented 1 year ago

I emailed Robert Edgar about the discrepancy between USEARCH and my implementation. He'd not looked at this code for a while but said it was using Levenstein distance, but with some heuristics on top of the basic algorithm. I've not tried comparing the USEARCH and VSEARCH implementations against each other to say how well it mimics those heuristics.

torognes commented 1 year ago

Hi

Sorry for the late reply. I actually wrote an answer two weeks ago, but it must have disappeared for some reason.

I think you are right that vsearch does not use Levenshtein distance, but rather the number of mismatches. I cannot answer for usearch, but we have tried to make vsearch as similar as possible in such respects.

You can actually view the alignment generated by vsearch using the option --alnout filename. I've done that for your example, and as you see, in this case the alignment contains 3 mismatches instead of 2 indels:

$ more vsearch.alnout 
vsearch --unoise_alpha 2 --minsize 4 --cluster_unoise duo100.fasta --centroids vsearch.fasta --sizein --sizeout --sizeorder --alnout vsearch.alnout
vsearch v2.22.1_macos_x86_64, 16.0GB RAM, 8 cores

Query >59434599a79bdf1571eb4932d9b0d801;size=100
 %Id   TLen  Target
 99%    230  f1d4c0432c89f005ec2b4e467839f5cc;size=12800

 Query 230nt >59434599a79bdf1571eb4932d9b0d801;size=100
Target 230nt >f1d4c0432c89f005ec2b4e467839f5cc;size=12800

Qry   1 + TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTA 64
          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Tgt   1 + TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTA 64

Qry  65 + TCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCT 128
          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Tgt  65 + TCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCT 128

Qry 129 + GCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAG 192
          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Tgt 129 + GCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAG 192

Qry 193 + CCCTACTTTTAAACCCATTCTGTAATACTGAACATACT 230
          ||   |||||||||||||||||||||||||||||||||
Tgt 193 + CCTCTCTTTTAAACCCATTCTGTAATACTGAACATACT 230

230 cols, 227 ids (98.7%), 0 gaps (0.0%)

The difference relative to Needle is due to a somewhat different scoring system (compared to Needle, vsearch uses higher gap penalties relative to mismatch penalties). This should hopefully clarify at least part of your question.

peterjc commented 1 year ago

Thank you.

Given how similarly the USEARCH and VSEARCH implementations of UNOISE seem to behave on this and my other examples (impressive reverse engineering there!), the simplest explanation is while Robert Edgar's initial implementation may have used Levenshtein distance as in Edgar 2016 manuscript https://www.biorxiv.org/content/10.1101/081257v1.full given his modifications since, it is now closer to a substitution count distance.

I think we can close this issue now (unless you want to add anything to the VSEARCH documentation about this point?).

torognes commented 1 year ago

Thanks. I'll add a note about this issue in the vsearch manual.

torognes commented 1 year ago

Added note in manual about cluster_unoise and the calculations.