peterjc / thapbi-pict

Tree Health and Plant Biosecurity Initiative - Phytophthora ITS1 Classifier Tool
https://thapbi-pict.readthedocs.io/
MIT License
8 stars 2 forks source link

chimera regression with VSEARCH 2.23.0 #563

Closed peterjc closed 1 year ago

peterjc commented 1 year ago

From the release notes for VSEARCH 2.23.0,

Update documentation. Add citation file. Modernize and improve code. Fix several minor bugs. Fix compilation with GCC 13. Print stats after fastq_mergepairs to log file instead of stderr. Handle sizein option correctly with dbmatched option for usearch_global. Allow maxseqlength option for makeudb_usearch. Fix memory allocation problem with chimera detection. Add lengthout and xlength options. Increase precision for eeout option. Add warning about sintax algorithm, random seed and multiple threads. Refactor chimera detection code. Add undocumented experimental long_chimeras_denovo command. Fix segfault with clustering. Add more references.

This fits with a new test failure on the master branch after commit https://github.com/peterjc/thapbi-pict/commit/d03730dcbc368cf562290f2d83f011d4ed0a533d (minor change):

Quoting https://app.circleci.com/pipelines/github/peterjc/thapbi-pict/3641/workflows/0dcd83cb-db0c-4646-a9d6-cdc245386761/jobs/3502

+ thapbi_pict denoise -i tests/read-correction/chimeras.before.fasta -o /tmp/thapbi_pict/sample-tally/after.fasta --denoise vsearch --minlen 60 -t 0
Loaded 10 unique sequences from 13644103 in total within length range, max abundance 4796067
Spent 0.0s running vsearch for read-corrections
VSEARCH reduced unique ASVs from 10 to 10, max abundance now 4796067
VSEARCH flagged 3 as chimeras
+ echo diff /tmp/thapbi_pict/sample-tally/after.fasta tests/read-correction/chimeras.vsearch.fasta
diff /tmp/thapbi_pict/sample-tally/after.fasta tests/read-correction/chimeras.vsearch.fasta
+ diff /tmp/thapbi_pict/sample-tally/after.fasta tests/read-correction/chimeras.vsearch.fasta
15c15
< >b5a466daa1cb779bb0d0dbfa39d49cb6_1617
---
> >b5a466daa1cb779bb0d0dbfa39d49cb6_1617 chimera 6e847180a4da6eed316e1fb98b21218f/af3282a9797a70b6922e29e68c0b2bdc

Confirmed locally:

$ thapbi_pict denoise -i tests/read-correction/chimeras.before.fasta -o /tmp/thapbi_pict/sample-tally/after.fasta --denoise vsearch --minlen 60 -t 0 --verbose
DEBUG: Parsing tests/read-correction/chimeras.before.fasta
Loaded 10 unique sequences from 13644103 in total within length range, max abundance 4796067
DEBUG: Starting read-correction with vsearch...
DEBUG: version of vsearch: v2.23.0
DEBUG: Shared temp folder /tmp/tmprg8z2yi1
Calling command: vsearch --sizein --sizeout --cluster_unoise /tmp/tmprg8z2yi1/noisy.fasta --uc /tmp/tmprg8z2yi1/clustering.tsv --threads 1
Calling command: vsearch --sizein --sizeout --uchime3_denovo /tmp/tmprg8z2yi1/denoised.fasta --uchimeout /tmp/tmprg8z2yi1/chimeras.tsv --threads 1
Spent 0.0s running vsearch for read-corrections
VSEARCH reduced unique ASVs from 10 to 10, max abundance now 4796067
VSEARCH flagged 3 as chimeras

Note only 3 chimeras, so the diff failed:

$ grep chimera /tmp/thapbi_pict/sample-tally/after.fasta tests/read-correction/chimeras.vsearch.fasta
/tmp/thapbi_pict/sample-tally/after.fasta:>d9e6de1308a8ac1448de351747d023c0_72646 chimera 972db44c016a166de86a2bacab3f4226/3d3fa2fd6fe0f183cad80771f5950b27
/tmp/thapbi_pict/sample-tally/after.fasta:>0984333c38352fd1333ab5faf4c760ef_856 chimera 6e847180a4da6eed316e1fb98b21218f/af3282a9797a70b6922e29e68c0b2bdc
/tmp/thapbi_pict/sample-tally/after.fasta:>3e602b47c64db19b5c4d8bdc29a833b1_667 chimera 32159de6cbb6df37d084e31c37c30e7b/dcd6316eb77be50ee344fbeca6e005c7
tests/read-correction/chimeras.vsearch.fasta:>d9e6de1308a8ac1448de351747d023c0_72646 chimera 972db44c016a166de86a2bacab3f4226/3d3fa2fd6fe0f183cad80771f5950b27
tests/read-correction/chimeras.vsearch.fasta:>b5a466daa1cb779bb0d0dbfa39d49cb6_1617 chimera 6e847180a4da6eed316e1fb98b21218f/af3282a9797a70b6922e29e68c0b2bdc
tests/read-correction/chimeras.vsearch.fasta:>0984333c38352fd1333ab5faf4c760ef_856 chimera 6e847180a4da6eed316e1fb98b21218f/af3282a9797a70b6922e29e68c0b2bdc
tests/read-correction/chimeras.vsearch.fasta:>3e602b47c64db19b5c4d8bdc29a833b1_667 chimera 32159de6cbb6df37d084e31c37c30e7b/dcd6316eb77be50ee344fbeca6e005c7

Downgrading to VSEARCH 2.22.1, the test passes - we again have 4 chimeras:

$ thapbi_pict denoise -i tests/read-correction/chimeras.before.fasta -o /tmp/thapbi_pict/sample-tally/after.fasta --denoise vsearch --minlen 60 -t 0 --verbose
DEBUG: Parsing tests/read-correction/chimeras.before.fasta
Loaded 10 unique sequences from 13644103 in total within length range, max abundance 4796067
DEBUG: Starting read-correction with vsearch...
DEBUG: version of vsearch: v2.22.1
DEBUG: Shared temp folder /tmp/tmpc0j0mnws
Calling command: vsearch --sizein --sizeout --cluster_unoise /tmp/tmpc0j0mnws/noisy.fasta --uc /tmp/tmpc0j0mnws/clustering.tsv --threads 1
Calling command: vsearch --sizein --sizeout --uchime3_denovo /tmp/tmpc0j0mnws/denoised.fasta --uchimeout /tmp/tmpc0j0mnws/chimeras.tsv --threads 1
Spent 0.0s running vsearch for read-corrections
VSEARCH reduced unique ASVs from 10 to 10, max abundance now 4796067
VSEARCH flagged 4 as chimeras
peterjc commented 1 year ago

Might be worth digging a little deaper into the new chimera detection mode added in this version of VSEARCH...

https://github.com/torognes/vsearch/commit/c2ffd0eb8dbfdf55b701f2127400b949bf0577ec