peterjc / thapbi-pict

Tree Health and Plant Biosecurity Initiative - Phytophthora ITS1 Classifier Tool
https://thapbi-pict.readthedocs.io/
MIT License
8 stars 2 forks source link

UNOISE within sample-tally step #538

Closed peterjc closed 1 year ago

peterjc commented 1 year ago

Builds on restructuring with introduction of sample-tally step, up to and including #535. Would close #147.

TODO:

peterjc commented 1 year ago

Applying this to the endangered_species/ has a noteworthy regression, it corrects one of the positive control sequences (Brassica napus) into another more common sequence (Brassica oleracea) which is only 1bp away.

From the raw summary/trnL-UAA.all_reads.fasta and summary/trnL-UAA.all_reads.onebp.tsv files:

>60f3f3d23c0340e2a0b7885f2bc7982c_233213 Brassica oleracea
GACTTAATTGGATTGAGCCTTGGTATGGAAACCTACTAAGTGATAACTTTCAAATTCAGAGAAACCCTGGAATTAACAATGGGCAATCCTGAGCCAAATCCTGGGTTACGCGAACAAAACAGAGTTTAGAAAGCGGGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTCAATCCCTTGTGTTGAATCAAACGATTCACTTCATAGTCTGATAGATCCTTGGTGGAACTTATTAATCGGACGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACTGACAACAATGAAATTTATAGTAAGATGAAAATCCGTTGACTTTTAAAATCGTGAGG
>779d09e7efb1679bc5bbe29c561f8dcf_833 Brassica napus
GACTTAATTGGATTGAGCCTTGGTATGGAAACCTACTAAGTGATAACTTTCAAATTCAGAGAAACCCTGGAATTAACAATGGGCAATCCTGAGCCAAATCCTGGGTTACGCGAACAAACCAGAGTTTAGAAAGCGGGATAGGTGCAGAGACTCAATGGAAGCTGTTCTAACAAATGGAGTTCAATCCCTTGTGTTGAATCAAACGATTCACTTCATAGTCTGATAGATCCTTGGTGGAACTTATTAATCGGACGAGAATAAAGATAGAGTCCCATTCTACATGTCAATACTGACAACAATGAAATTTATAGTAAGATGAAAATCCGTTGACTTTTAAAATCGTGAGG

Plugging variant abundance 833 into the centroid abundance 233213, noting the distance is 1bp and alpha is 2, this is well within the UNOISE beta function threshold:

>>> a = 833
>>> dist = 1
>>> unoise_alpha=2.0
>>> a * 2 ** (unoise_alpha * dist + 1)
6664.0

We would have to increase the UNOISE alpha parameter to over 7.13 to stop this unwanted correction. i.e. Impractical.

The Brassica were already problematic within this example.

peterjc commented 1 year ago

It appears UNOISE2 and UNOISE3 as implemented in USEARCH do not actually use the Levenstein Distance as per the algorithm description in Edgar (2016).

Cross reference https://github.com/torognes/vsearch/issues/503 with a test case - the VSEARCH team also found this discrepancy, and aims to match USEARCH. Their implementation appears to use Needleman-Wunsch alignment with given scoring, and takes the mismatch count as the distance for UNOISE.

Possible options:

Speed wise, I expect finding the Levenstein Distance to be similar to doing NW (aside the overhead of back-tracing the matrix to extract the alignment which we don't really need/want except for the mismatch count).

peterjc commented 1 year ago

Test case, centroid and variant behaving as if edit distance 3 (but Levenshtein distance is 2):

$ cat duo100.fasta 
>f1d4c0432c89f005ec2b4e467839f5cc;size=12800
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCTCTCTTTTAAACCCATTCTGTAATACTGAACATACT
>59434599a79bdf1571eb4932d9b0d801;size=100
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCCTACTTTTAAACCCATTCTGTAATACTGAACATACT
$ cat duo101.fasta 
>f1d4c0432c89f005ec2b4e467839f5cc;size=12800
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCTCTCTTTTAAACCCATTCTGTAATACTGAACATACT
>59434599a79bdf1571eb4932d9b0d801;size=101
TTTCCGTAGGTGAACCTGCGGAAGGATCATTACCACACCTAAAAAACTTTCCACGTGAACCGTATCAACCCAATTAGTTGGGGGCCTGCTCTGGGCGGCGGCTGTCGATGTCAAAGTCGACGGCTGCTGCTGCGTGGCGGGCCCTATCACTGGCGAGCGTTTGGGTCCCTCTCGGGGGAACTGAGCTAGTAGCCCTACTTTTAAACCCATTCTGTAATACTGAACATACT
$ diff duo100.fasta duo101.fasta 
3c3
< >59434599a79bdf1571eb4932d9b0d801;size=100
---
> >59434599a79bdf1571eb4932d9b0d801;size=101

UNOISE2 via USEARCH v9 shows critical ratio to be 128:

$ usearch9.1.13_i86linux32 -unoise2 duo100.fasta -minampsize 4 -unoise_alpha 2 -fastaout denoised.fa ; grep '^>' denoised.fa
usearch v9.1.13_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: personal use, non-transferrable

00:00 208Mb   100.0% Word stats
00:00 208Mb   100.0% Alloc rows
00:00 208Mb   100.0% Build index
00:00 178Mb   100.0% Reading duo100.fasta
00:00 148Mb   100.0% Size=100, 1 amplicons, 0 corrected reads  
00:00 152Mb  Sorting                                         
00:00 149Mb   100.0% 1 good, 0 chimeras
00:00 149Mb   100.0% Writing amplicons 
>Otu1;uniq=f1d4c0432c89f005ec2b4e467839f5cc;uniqsize=12800;size=12900;

$ usearch9.1.13_i86linux32 -unoise2 duo101.fasta -minampsize 4 -unoise_alpha 2 -fastaout denoised.fa ; grep '^>' denoised.fa
usearch v9.1.13_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: personal use, non-transferrable

00:00 208Mb   100.0% Word stats
00:00 208Mb   100.0% Alloc rows
00:01 208Mb   100.0% Build index
00:01 178Mb   100.0% Reading duo101.fasta
00:01 148Mb   100.0% Size=101, 1 amplicons, 0 corrected reads  
00:01 152Mb  Sorting                                         
00:01 152Mb   100.0% 2 good, 0 chimeras
00:01 152Mb   100.0% Writing amplicons 
>Otu1;uniq=f1d4c0432c89f005ec2b4e467839f5cc;uniqsize=12800;size=12800;
>Otu2;uniq=59434599a79bdf1571eb4932d9b0d801;uniqsize=101;size=101;

$ usearch9.2.64_i86linux32 -unoise2 duo100.fasta -minampsize 4 -unoise_alpha 2 -fastaout denoised.fa ; grep '^>' denoised.fa
usearch v9.2.64_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: personal use, non-transferrable

00:00 208Mb   100.0% Word stats
00:00 208Mb   100.0% Alloc rows
00:00 208Mb   100.0% Build index
00:00 178Mb   100.0% Reading duo100.fasta
00:00 148Mb   100.0% 1 amplicons, 0 bad (size >= 100)  
00:00 153Mb   100.0% 1 good, 0 chimeras              
00:00 153Mb   100.0% Writing amplicons 
>Otu1;uniq=f1d4c0432c89f005ec2b4e467839f5cc;size=12800;

$ usearch9.2.64_i86linux32 -unoise2 duo101.fasta -minampsize 4 -unoise_alpha 2 -fastaout denoised.fa ; grep '^>' denoised.fa
usearch v9.2.64_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-16 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: personal use, non-transferrable

00:00 208Mb   100.0% Word stats
00:00 208Mb   100.0% Alloc rows
00:00 208Mb   100.0% Build index
00:00 178Mb   100.0% Reading duo101.fasta
00:00 148Mb   100.0% 1 amplicons, 0 bad (size >= 101)  
00:00 156Mb   100.0% 2 good, 0 chimeras              
00:00 156Mb   100.0% Writing amplicons 
>Otu1;uniq=f1d4c0432c89f005ec2b4e467839f5cc;size=12800;
>Otu2;uniq=59434599a79bdf1571eb4932d9b0d801;size=101;

And again with UNOISE3 via USEARCH v10 and v11, critical ratio to be 128::

$ usearch10.0.259_i86linux32 -unoise3 duo100.fasta -minsize 4 -unoise_alpha 2 -zotus denoised.fa ; grep '^>' denoised.fa
usearch v10.0.259_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-17 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: personal use, non-transferrable

00:00 40Mb    100.0% Reading duo100.fasta
00:00 11Mb    100.0% 1 amplicons, 0 bad (size >= 100)  
00:00 15Mb    100.0% 1 good, 0 chimeras              
00:00 15Mb    100.0% Writing zotus     
>Zotu1

$ usearch10.0.259_i86linux32 -unoise3 duo101.fasta -minsize 4 -unoise_alpha 2 -zotus denoised.fa ; grep '^>' denoised.fa
usearch v10.0.259_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-17 Robert C. Edgar, all rights reserved.
http://drive5.com/usearch

License: personal use, non-transferrable

00:00 40Mb    100.0% Reading duo101.fasta
00:00 11Mb    100.0% 1 amplicons, 0 bad (size >= 101)  
00:00 18Mb    100.0% 2 good, 0 chimeras              
00:00 18Mb    100.0% Writing zotus     
>Zotu1
>Zotu2

$ usearch11.0.667_i86linux32 -unoise3 duo100.fasta -minsize 4 -unoise_alpha 2 -zotus denoised.fa ; grep '^>' denoised.fa
usearch v11.0.667_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch

License: personal use only

00:00 41Mb    100.0% Reading duo100.fasta
00:00 11Mb    100.0% 1 amplicons, 0 bad (size >= 100)  
00:00 16Mb    100.0% 1 good, 0 chimeras              
00:00 16Mb    100.0% Writing zotus     
>Zotu1

$ usearch11.0.667_i86linux32 -unoise3 duo101.fasta -minsize 4 -unoise_alpha 2 -zotus denoised.fa ; grep '^>' denoised.fa
usearch v11.0.667_i86linux32, 4.0Gb RAM (32.9Gb total), 12 cores
(C) Copyright 2013-18 Robert C. Edgar, all rights reserved.
https://drive5.com/usearch

License: personal use only

00:00 41Mb    100.0% Reading duo101.fasta
00:00 11Mb    100.0% 1 amplicons, 0 bad (size >= 101)  
00:00 19Mb    100.0% 2 good, 0 chimeras              
00:00 19Mb    100.0% Writing zotus     
>Zotu1
>Zotu2

Again with VSEARCH,

$ vsearch --unoise_alpha 2 --minsize 4 --cluster_unoise duo100.fasta --centroids vsearch.fasta --sizein --sizeout --sizeorder ; grep "^>" vsearch.fasta
vsearch v2.21.2_linux_x86_64, 31.4GB RAM, 12 cores
https://github.com/torognes/vsearch

Reading file duo100.fasta 100%  
460 nt in 2 seqs, min 230, max 230, avg 230
Masking 100% 
Sorting by abundance 100%
Counting k-mers 100% 
Clustering 100%  
Sorting clusters 100%
Writing clusters 100% 
Clusters: 1 Size min 12900, max 12900, avg 2.0
Singletons: 0, 0.0% of seqs, 0.0% of clusters
>f1d4c0432c89f005ec2b4e467839f5cc;size=12900

$ vsearch --unoise_alpha 2 --minsize 4 --cluster_unoise duo101.fasta --centroids vsearch.fasta --sizein --sizeout --sizeorder ; grep "^>" vsearch.fasta
vsearch v2.21.2_linux_x86_64, 31.4GB RAM, 12 cores
https://github.com/torognes/vsearch

Reading file duo101.fasta 100%  
460 nt in 2 seqs, min 230, max 230, avg 230
Masking 100% 
Sorting by abundance 100%
Counting k-mers 100% 
Clustering 100%  
Sorting clusters 100%
Writing clusters 100% 
Clusters: 2 Size min 101, max 12800, avg 1.0
Singletons: 0, 0.0% of seqs, 0.0% of clusters
>f1d4c0432c89f005ec2b4e467839f5cc;size=12800
>59434599a79bdf1571eb4932d9b0d801;size=101

i.e. Ratio of exactly 1:128 merged (one centroid), but if the variant is more abundant and thus over this ratio, not merged (two centroids)

This would be fine if dist=3, since beta function uses 2^(alpha*dist+1) = 2^7 = 128

However, Levenshtein distance is 2, giving 2^(alpha*dist+1) = 2^5 = 32, meaning abundance threshold for not-merging is much higher at abundance 400 (since 12800 / 400 = 32).

peterjc commented 1 year ago

It seems workable to take the reference sequences in our database as trusted, and special case a UNOISE correction when both the query and target are in the database - and instead of applying the correction, accept it as a new centroid. This avoids the Brassica regression noted above.

peterjc commented 1 year ago

Using this on the drained_ponds example needs further examination,

 =================================== === === === ==== ==== ===========
 #Species                            TP  FP  FN  TN   F1   Ad-hoc-loss
 =================================== === === === ==== ==== ===========
-OVERALL                             433 388 331 5877 0.55 0.624
+OVERALL                             431 361 333 5904 0.55 0.617
 (Off-target) Anatidae (waterfowl)   0   70  0   29   0.00 1.000
 (Off-target) Apodemus               0   4   0   95   0.00 1.000
 (Off-target) Ardea cinerea          0   11  0   88   0.00 1.000
-(Off-target) Bos taurus             0   3   0   96   0.00 1.000
+(Off-target) Bos taurus             0   4   0   95   0.00 1.000
 (Off-target) Canis lupus familiaris 0   7   0   92   0.00 1.000
-(Off-target) Capra hircus           0   1   0   98   0.00 1.000
 (Off-target) Columba                0   47  0   52   0.00 1.000
 (Off-target) Gallinula chloropus    0   50  0   49   0.00 1.000
 (Off-target) Gallus gallus          0   13  0   86   0.00 1.000
 (Off-target) Homo sapiens           0   83  0   16   0.00 1.000
-(Off-target) Ovis aries             0   17  0   82   0.00 1.000
-(Off-target) Ovis dalli             0   1   0   98   0.00 1.000
+(Off-target) Ovis aries             0   18  0   81   0.00 1.000
 (Off-target) Phalacrocorax carbo    0   25  0   74   0.00 1.000
 (Off-target) Sturnus                0   3   0   96   0.00 1.000
 (Off-target) Sus scrofa             0   16  0   83   0.00 1.000
 (Off-target) Turdus                 0   7   0   92   0.00 1.000
 Abramis brama                       65  0   16  18   0.89 0.198
 Acipenser spp.                      0   0   9   90   0.00 1.000
-Alburnus mossulensis                0   1   0   98   0.00 1.000
 Astatotilapia calliptera            4   0   0   95   1.00 0.000
 Barbus barbus                       46  0   35  18   0.72 0.432
 Carassius carassius                 64  0   17  18   0.88 0.210
-Ctenopharyngodon idella             3   15  6   75   0.22 0.875
+Ctenopharyngodon idella             1   0   8   90   0.20 0.889
 Cyprinus carpio                     61  0   20  18   0.86 0.247
 Maylandia zebra                     4   0   0   95   1.00 0.000
 Perca fluviatilis                   40  0   41  18   0.66 0.506
@@ -132,11 +129,10 @@ Pseudorasbora parva                 0   2   0   97   0.00 1.000
 Rutilus rutilus                     63  0   18  18   0.88 0.222
 Scardinius erythrophthalmus         6   0   75  18   0.14 0.926
 Silurus glanis                      9   0   0   90   1.00 0.000
-Spinibarbus denticulatus            0   11  0   88   0.00 1.000
 Squalidus gracilis                  0   1   0   98   0.00 1.000
 Squalius cephalus                   6   0   75  18   0.14 0.926
 Tinca tinca                         62  0   19  18   0.87 0.235
-OTHER 37 SPECIES IN DB              0   0   0   3663 0.00 0.000
+OTHER 41 SPECIES IN DB              0   0   0   4059 0.00 0.000
 =================================== === === === ==== ==== ===========

Presumably the former Ctenopharyngodon idella and Spinibarbus denticulatus matches (mostly false positives, but also the two TP changed to FN) are now unknowns?

Edit: Due to these:

>4c53f6ed1ecdad3af2299999ec83d756 Carassius carassius;Spinibarbus denticulatus
ACTATGCTCAGCCGTAAACTTAGACATCCTACTACAATAGATGTCCGCCAGGGTACTACGAGCATTAGCTTAAAACCCAAAGGACCTGACGGTGTCTCAGACCCCC

Would have been corrected to d0069a8ca898fce2fea087ac19c11e1f Carassius carassius

>285edce3d193c92b1959e60bc130b518 Ctenopharyngodon idella;Tinca tinca
ACTATGCTCAGCCATAAACCTAGACATCCACCTACAATTAAACGTCCGCCCGGGTACTACGAGCATTAGCTTGAAACCCAAAGGACCTGACGGTGCCTTAGACCCCC

Would have been corrected to 8b43ed15592de8e0fb0cd95d23517d85 Tinca tinca, rather than a6645bbe7ef87bffbae42430f9488f88 Ctenopharyngodon idella (1bp from both, but the abundance difference is massive).

peterjc commented 1 year ago

Hmm, soil_nematodes example is not as good with Tripyla glomerans, might need a little closer look.