Closed peterjc closed 1 year ago
Applying this to the endangered_species/
has a noteworthy regression, it corrects one of the positive control sequences (Brassica napus) into another more common sequence (Brassica oleracea) which is only 1bp away.
From the raw summary/trnL-UAA.all_reads.fasta
and summary/trnL-UAA.all_reads.onebp.tsv
files:
>60f3f3d23c0340e2a0b7885f2bc7982c_233213 Brassica oleracea
GACTTAATTGGATTGAGCCTTGGTATGGAAACCTACTAAGTGATAACTTTCAAATTCAGAGAAACCCTGGAATTAACAATGGGCAATCCTGAGCCAAATCCTGGGTTACGCGAACAAAACAGAGTTTAGAAAGCGGGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTCAATCCCTTGTGTTGAATCAAACGATTCACTTCATAGTCTGATAGATCCTTGGTGGAACTTATTAATCGGACGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACTGACAACAATGAAATTTATAGTAAGATGAAAATCCGTTGACTTTTAAAATCGTGAGG
>779d09e7efb1679bc5bbe29c561f8dcf_833 Brassica napus
GACTTAATTGGATTGAGCCTTGGTATGGAAACCTACTAAGTGATAACTTTCAAATTCAGAGAAACCCTGGAATTAACAATGGGCAATCCTGAGCCAAATCCTGGGTTACGCGAACAAACCAGAGTTTAGAAAGCGGGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTCAATCCCTTGTGTTGAATCAAACGATTCACTTCATAGTCTGATAGATCCTTGGTGGAACTTATTAATCGGACGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACTGACAACAATGAAATTTATAGTAAGATGAAAATCCGTTGACTTTTAAAATCGTGAGG
Plugging variant abundance 833 into the centroid abundance 233213, noting the distance is 1bp and alpha is 2, this is well within the UNOISE beta function threshold:
>>> a = 833
>>> dist = 1
>>> unoise_alpha=2.0
>>> a * 2 ** (unoise_alpha * dist + 1)
6664.0
We would have to increase the UNOISE alpha parameter to over 7.13 to stop this unwanted correction. i.e. Impractical.
The Brassica were already problematic within this example.
It appears UNOISE2 and UNOISE3 as implemented in USEARCH do not actually use the Levenstein Distance as per the algorithm description in Edgar (2016).
Cross reference https://github.com/torognes/vsearch/issues/503 with a test case - the VSEARCH team also found this discrepancy, and aims to match USEARCH. Their implementation appears to use Needleman-Wunsch alignment with given scoring, and takes the mismatch count as the distance for UNOISE.
Possible options:
Speed wise, I expect finding the Levenstein Distance to be similar to doing NW (aside the overhead of back-tracing the matrix to extract the alignment which we don't really need/want except for the mismatch count).
Test case, centroid and variant behaving as if edit distance 3 (but Levenshtein distance is 2):
$ cat duo100.fasta
>f1d4c0432c89f005ec2b4e467839f5cc;size=12800
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCTCTCTTTTAAACCCATTCTGTAATACTGAACATACT
>59434599a79bdf1571eb4932d9b0d801;size=100
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCCTACTTTTAAACCCATTCTGTAATACTGAACATACT
$ cat duo101.fasta
>f1d4c0432c89f005ec2b4e467839f5cc;size=12800
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCTCTCTTTTAAACCCATTCTGTAATACTGAACATACT
>59434599a79bdf1571eb4932d9b0d801;size=101
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCCTACTTTTAAACCCATTCTGTAATACTGAACATACT
$ diff duo100.fasta duo101.fasta
3c3
< >59434599a79bdf1571eb4932d9b0d801;size=100
---
> >59434599a79bdf1571eb4932d9b0d801;size=101
UNOISE2 via USEARCH v9 shows critical ratio to be 128:
$ usearch9.1.13_i86linux32 -unoise2 duo100.fasta -minampsize 4 -unoise_alpha 2 -fastaout denoised.fa ; grep '^>' denoised.fa
usearch v9.1.13_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch
License: personal use, non-transferrable
00:00 208Mb 100.0% Word stats
00:00 208Mb 100.0% Alloc rows
00:00 208Mb 100.0% Build index
00:00 178Mb 100.0% Reading duo100.fasta
00:00 148Mb 100.0% Size=100, 1 amplicons, 0 corrected reads
00:00 152Mb Sorting
00:00 149Mb 100.0% 1 good, 0 chimeras
00:00 149Mb 100.0% Writing amplicons
>Otu1;uniq=f1d4c0432c89f005ec2b4e467839f5cc;uniqsize=12800;size=12900;
$ usearch9.1.13_i86linux32 -unoise2 duo101.fasta -minampsize 4 -unoise_alpha 2 -fastaout denoised.fa ; grep '^>' denoised.fa
usearch v9.1.13_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch
License: personal use, non-transferrable
00:00 208Mb 100.0% Word stats
00:00 208Mb 100.0% Alloc rows
00:01 208Mb 100.0% Build index
00:01 178Mb 100.0% Reading duo101.fasta
00:01 148Mb 100.0% Size=101, 1 amplicons, 0 corrected reads
00:01 152Mb Sorting
00:01 152Mb 100.0% 2 good, 0 chimeras
00:01 152Mb 100.0% Writing amplicons
>Otu1;uniq=f1d4c0432c89f005ec2b4e467839f5cc;uniqsize=12800;size=12800;
>Otu2;uniq=59434599a79bdf1571eb4932d9b0d801;uniqsize=101;size=101;
$ usearch9.2.64_i86linux32 -unoise2 duo100.fasta -minampsize 4 -unoise_alpha 2 -fastaout denoised.fa ; grep '^>' denoised.fa
usearch v9.2.64_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch
License: personal use, non-transferrable
00:00 208Mb 100.0% Word stats
00:00 208Mb 100.0% Alloc rows
00:00 208Mb 100.0% Build index
00:00 178Mb 100.0% Reading duo100.fasta
00:00 148Mb 100.0% 1 amplicons, 0 bad (size >= 100)
00:00 153Mb 100.0% 1 good, 0 chimeras
00:00 153Mb 100.0% Writing amplicons
>Otu1;uniq=f1d4c0432c89f005ec2b4e467839f5cc;size=12800;
$ usearch9.2.64_i86linux32 -unoise2 duo101.fasta -minampsize 4 -unoise_alpha 2 -fastaout denoised.fa ; grep '^>' denoised.fa
usearch v9.2.64_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch
License: personal use, non-transferrable
00:00 208Mb 100.0% Word stats
00:00 208Mb 100.0% Alloc rows
00:00 208Mb 100.0% Build index
00:00 178Mb 100.0% Reading duo101.fasta
00:00 148Mb 100.0% 1 amplicons, 0 bad (size >= 101)
00:00 156Mb 100.0% 2 good, 0 chimeras
00:00 156Mb 100.0% Writing amplicons
>Otu1;uniq=f1d4c0432c89f005ec2b4e467839f5cc;size=12800;
>Otu2;uniq=59434599a79bdf1571eb4932d9b0d801;size=101;
And again with UNOISE3 via USEARCH v10 and v11, critical ratio to be 128::
$ usearch10.0.259_i86linux32 -unoise3 duo100.fasta -minsize 4 -unoise_alpha 2 -zotus denoised.fa ; grep '^>' denoised.fa
usearch v10.0.259_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-17 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch
License: personal use, non-transferrable
00:00 40Mb 100.0% Reading duo100.fasta
00:00 11Mb 100.0% 1 amplicons, 0 bad (size >= 100)
00:00 15Mb 100.0% 1 good, 0 chimeras
00:00 15Mb 100.0% Writing zotus
>Zotu1
$ usearch10.0.259_i86linux32 -unoise3 duo101.fasta -minsize 4 -unoise_alpha 2 -zotus denoised.fa ; grep '^>' denoised.fa
usearch v10.0.259_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-17 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch
License: personal use, non-transferrable
00:00 40Mb 100.0% Reading duo101.fasta
00:00 11Mb 100.0% 1 amplicons, 0 bad (size >= 101)
00:00 18Mb 100.0% 2 good, 0 chimeras
00:00 18Mb 100.0% Writing zotus
>Zotu1
>Zotu2
$ usearch11.0.667_i86linux32 -unoise3 duo100.fasta -minsize 4 -unoise_alpha 2 -zotus denoised.fa ; grep '^>' denoised.fa
usearch v11.0.667_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch
License: personal use only
00:00 41Mb 100.0% Reading duo100.fasta
00:00 11Mb 100.0% 1 amplicons, 0 bad (size >= 100)
00:00 16Mb 100.0% 1 good, 0 chimeras
00:00 16Mb 100.0% Writing zotus
>Zotu1
$ usearch11.0.667_i86linux32 -unoise3 duo101.fasta -minsize 4 -unoise_alpha 2 -zotus denoised.fa ; grep '^>' denoised.fa
usearch v11.0.667_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch
License: personal use only
00:00 41Mb 100.0% Reading duo101.fasta
00:00 11Mb 100.0% 1 amplicons, 0 bad (size >= 101)
00:00 19Mb 100.0% 2 good, 0 chimeras
00:00 19Mb 100.0% Writing zotus
>Zotu1
>Zotu2
Again with VSEARCH,
$ vsearch --unoise_alpha 2 --minsize 4 --cluster_unoise duo100.fasta --centroids vsearch.fasta --sizein --sizeout --sizeorder ; grep "^>" vsearch.fasta
vsearch v2.21.2_linux_x86_64, 31.4GB RAM, 12 cores
https://github.com/torognes/vsearch
Reading file duo100.fasta 100%
460 nt in 2 seqs, min 230, max 230, avg 230
Masking 100%
Sorting by abundance 100%
Counting k-mers 100%
Clustering 100%
Sorting clusters 100%
Writing clusters 100%
Clusters: 1 Size min 12900, max 12900, avg 2.0
Singletons: 0, 0.0% of seqs, 0.0% of clusters
>f1d4c0432c89f005ec2b4e467839f5cc;size=12900
$ vsearch --unoise_alpha 2 --minsize 4 --cluster_unoise duo101.fasta --centroids vsearch.fasta --sizein --sizeout --sizeorder ; grep "^>" vsearch.fasta
vsearch v2.21.2_linux_x86_64, 31.4GB RAM, 12 cores
https://github.com/torognes/vsearch
Reading file duo101.fasta 100%
460 nt in 2 seqs, min 230, max 230, avg 230
Masking 100%
Sorting by abundance 100%
Counting k-mers 100%
Clustering 100%
Sorting clusters 100%
Writing clusters 100%
Clusters: 2 Size min 101, max 12800, avg 1.0
Singletons: 0, 0.0% of seqs, 0.0% of clusters
>f1d4c0432c89f005ec2b4e467839f5cc;size=12800
>59434599a79bdf1571eb4932d9b0d801;size=101
i.e. Ratio of exactly 1:128 merged (one centroid), but if the variant is more abundant and thus over this ratio, not merged (two centroids)
This would be fine if dist=3, since beta function uses 2^(alpha*dist+1) = 2^7 = 128
However, Levenshtein distance is 2, giving 2^(alpha*dist+1) = 2^5 = 32, meaning abundance threshold for not-merging is much higher at abundance 400 (since 12800 / 400 = 32).
It seems workable to take the reference sequences in our database as trusted, and special case a UNOISE correction when both the query and target are in the database - and instead of applying the correction, accept it as a new centroid. This avoids the Brassica regression noted above.
Using this on the drained_ponds example needs further examination,
=================================== === === === ==== ==== ===========
#Species TP FP FN TN F1 Ad-hoc-loss
=================================== === === === ==== ==== ===========
-OVERALL 433 388 331 5877 0.55 0.624
+OVERALL 431 361 333 5904 0.55 0.617
(Off-target) Anatidae (waterfowl) 0 70 0 29 0.00 1.000
(Off-target) Apodemus 0 4 0 95 0.00 1.000
(Off-target) Ardea cinerea 0 11 0 88 0.00 1.000
-(Off-target) Bos taurus 0 3 0 96 0.00 1.000
+(Off-target) Bos taurus 0 4 0 95 0.00 1.000
(Off-target) Canis lupus familiaris 0 7 0 92 0.00 1.000
-(Off-target) Capra hircus 0 1 0 98 0.00 1.000
(Off-target) Columba 0 47 0 52 0.00 1.000
(Off-target) Gallinula chloropus 0 50 0 49 0.00 1.000
(Off-target) Gallus gallus 0 13 0 86 0.00 1.000
(Off-target) Homo sapiens 0 83 0 16 0.00 1.000
-(Off-target) Ovis aries 0 17 0 82 0.00 1.000
-(Off-target) Ovis dalli 0 1 0 98 0.00 1.000
+(Off-target) Ovis aries 0 18 0 81 0.00 1.000
(Off-target) Phalacrocorax carbo 0 25 0 74 0.00 1.000
(Off-target) Sturnus 0 3 0 96 0.00 1.000
(Off-target) Sus scrofa 0 16 0 83 0.00 1.000
(Off-target) Turdus 0 7 0 92 0.00 1.000
Abramis brama 65 0 16 18 0.89 0.198
Acipenser spp. 0 0 9 90 0.00 1.000
-Alburnus mossulensis 0 1 0 98 0.00 1.000
Astatotilapia calliptera 4 0 0 95 1.00 0.000
Barbus barbus 46 0 35 18 0.72 0.432
Carassius carassius 64 0 17 18 0.88 0.210
-Ctenopharyngodon idella 3 15 6 75 0.22 0.875
+Ctenopharyngodon idella 1 0 8 90 0.20 0.889
Cyprinus carpio 61 0 20 18 0.86 0.247
Maylandia zebra 4 0 0 95 1.00 0.000
Perca fluviatilis 40 0 41 18 0.66 0.506
@@ -132,11 +129,10 @@ Pseudorasbora parva 0 2 0 97 0.00 1.000
Rutilus rutilus 63 0 18 18 0.88 0.222
Scardinius erythrophthalmus 6 0 75 18 0.14 0.926
Silurus glanis 9 0 0 90 1.00 0.000
-Spinibarbus denticulatus 0 11 0 88 0.00 1.000
Squalidus gracilis 0 1 0 98 0.00 1.000
Squalius cephalus 6 0 75 18 0.14 0.926
Tinca tinca 62 0 19 18 0.87 0.235
-OTHER 37 SPECIES IN DB 0 0 0 3663 0.00 0.000
+OTHER 41 SPECIES IN DB 0 0 0 4059 0.00 0.000
=================================== === === === ==== ==== ===========
Presumably the former Ctenopharyngodon idella and Spinibarbus denticulatus matches (mostly false positives, but also the two TP changed to FN) are now unknowns?
Edit: Due to these:
>4c53f6ed1ecdad3af2299999ec83d756 Carassius carassius;Spinibarbus denticulatus
ACTATGCTCAGCCGTAAACTTAGACATCCTACTACAATAGATGTCCGCCAGGGTACTACGAGCATTAGCTTAAAACCCAAAGGACCTGACGGTGTCTCAGACCCCC
Would have been corrected to d0069a8ca898fce2fea087ac19c11e1f Carassius carassius
>285edce3d193c92b1959e60bc130b518 Ctenopharyngodon idella;Tinca tinca
ACTATGCTCAGCCATAAACCTAGACATCCACCTACAATTAAACGTCCGCCCGGGTACTACGAGCATTAGCTTGAAACCCAAAGGACCTGACGGTGCCTTAGACCCCC
Would have been corrected to 8b43ed15592de8e0fb0cd95d23517d85 Tinca tinca, rather than a6645bbe7ef87bffbae42430f9488f88 Ctenopharyngodon idella (1bp from both, but the abundance difference is massive).
Hmm, soil_nematodes example is not as good with Tripyla glomerans, might need a little closer look.
Builds on restructuring with introduction of
sample-tally
step, up to and including #535. Would close #147.TODO: