vuthuyduong / dnabarcoder

Apache License 2.0
11 stars 5 forks source link

Too many unassigned OTUs #2

Open LukeLikesDirt opened 2 months ago

LukeLikesDirt commented 2 months ago

Hi Vuthuyduong,

I have followed your pipeline for the classification of some ASVs. My reads are ITS1 extracted, and I’ve used the ITS1 extracted UNITE v10 database that you prepared (thank you!). However, most of my ASVs (~85%) are unassigned at the fungi level after classification. This doesn’t make sense to me, as the majority of the unassigned ASVs had coverage > 90% and similarity > 95% when I previously BLASTed them. And a good portion had coverage = 100% and similarity > 98%.

Below is the head of the “bestmatch” file and the “classified” and “classification” files. I have also included your script, which I slightly modified to run on the cluster and fit my project (perhaps I made an error here?), as well as the SLURM file.

There is one error in the SLURM file: “sh: ImportText.pl: command not found”. I think this is related to the krona.html file, which was not written.

Have I made an error, or do I not understand how the classification is supposed to work?

Thank you in advance for your help.

Luke

Script

# Constants and subdirectories readonly THREADS=8 readonly REFERENCE_SEQUENCES="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta" readonly BEST_MATCH="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.unique.cutoffs.best.json" readonly CLASSIFIER="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.classification" readonly QUERY_SEQUENCES="../../data/bioinformatics/08.ASVs/ASVs.fasta" readonly OUTPUT="../../data/bioinformatics/09.Taxonomy"

log 'Starting at:'

# Search for the best matches of the sequences python dnabarcoder/dnabarcoder.py search \ -i $QUERY_SEQUENCES \ -r $REFERENCE_SEQUENCES \ -ml 50

# Assign the sequences to different taxonomic groups python dnabarcoder/dnabarcoder.py classify \ -i dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch \ -c $CLASSIFIER \ -cutoffs $BEST_MATCH

# Move the classification files to the taxonomy subdirectory mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classified $OUTPUT/ASVs.unite2024ITS1_BLAST.classified.txt mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classification $OUTPUT/ASVs.unite2024ITS1_BLAST.classification.txt

log 'Finished at:'

SLURM

Starting at: Sat Jul 27 06:40:17 AEST 2024

Building a new DB, current time: 07/27/2024 06:44:02 New DB name: /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb New DB title: ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta Sequence type: Nucleotide Deleted existing Nucleotide BLAST database named /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb Keep MBits: T Maximum file size: 3000000000B FASTA-Reader: Ignoring invalid residues at position(s): On line 629439: 57 FASTA-Reader: Ignoring invalid residues at position(s): On line 629440: 1-7 Adding sequences from FASTA; added 1899789 sequences in 21.573 seconds.

makeblastdb -in ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta -dbtype 'nucl' -out ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb blastn -query ../../data/bioinformatics/08.ASVs/ASVs.indexed.fasta -db ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb -task blastn-short -outfmt 6 -out ../../data/bioinformatics/08.ASVs/ASVs.unite2024ITS1.blastoutput -num_threads 96 The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch

sh: ImportText.pl: command not found Number of classified sequences: 22362 The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.classified and dnabarcoder/ASVs.unite2024ITS1_BLAST.classification. The krona report and html are saved in files dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.report and dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.html.

Finished at: Sat Jul 27 20:39:15 AEST 2024

Bestmatch file

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

ID | ReferenceID | BLAST score | BLAST sim | BLAST coverage -- | -- | -- | -- | -- ASV_1;size=2362642 | UDB05261093 | 1 | 1 | 178 ASV_2;size=1037588 | MW856689 | 1 | 1 | 132 ASV_3;size=1412752 | UDB01614261 | 0.5425 | 0.875 | 31 ASV_4;size=2201923 | UDB03085057 | 1 | 1 | 160 ASV_5;size=3340601 | UDB03119248 | 1 | 1 | 154 ASV_6;size=823557 | UDB01261625 | 0.9823000000000001 | 0.9823000000000001 | 112 ASV_7;size=830877 | MW214811 | 1 | 1 | 157 ASV_8;size=4359501 | UDB05107909 | 1 | 1 | 151 ASV_9;size=408829 | MZ016271 | 0.7246504 | 0.88372 | 41 ASV_10;size=176701 | UDB05818296 | 1 | 1 | 179 ASV_11;size=162862 | UDB04293913 | 1 | 1 | 141 ASV_12;size=1535429 | MT991106 | 1 | 1 | 146 ASV_13;size=169945 | UDB07371928 | 0.98324 | 0.98324 | 177 ASV_14;size=130846 | UDB02651623 | 0.9607800000000001 | 0.9607800000000001 | 50 ASV_15;size=978833 | UDB03975281 | 1 | 1 | 135

Classification file

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

ID | ReferenceID | kingdom | phylum | class | order | family | genus | species | rank | score | cutoff | confidence -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- ASV_1;size=2362642 | UDB05261093 | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified |   | 1 | N/A | N/A ASV_2;size=1037588 | MW856689 | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified |   | 1 | N/A | N/A ASV_3;size=1412752 | UDB01614261 | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified |   | 0.5425 | N/A | N/A ASV_4;size=2201923 | UDB03085057 | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified |   | 1 | N/A | N/A ASV_5;size=3340601 | UDB03119248 | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified |   | 1 | N/A | N/A ASV_6;size=823557 | UDB01261625 | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified |   | 0.9823000000000001 | N/A | N/A ASV_7;size=830877 | MW214811 | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified |   | 1 | N/A | N/A ASV_8;size=4359501 | UDB05107909 | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified |   | 1 | N/A | N/A ASV_9;size=408829 | MZ016271 | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified |   | 0.7246504 | N/A | N/A ASV_10;size=176701 | UDB05818296 | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified |   | 1 | N/A | N/A ASV_11;size=162862 | UDB04293913 | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified |   | 1 | N/A | N/A ASV_12;size=1535429 | MT991106 | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified |   | 1 | N/A | N/A ASV_13;size=169945 | UDB07371928 | Fungi | Ascomycota | Pezizomycetes | Pezizales | Pezizales fam Incertae sedis | Sphaerosoma | unidentified | genus | 0.98324 | 0.969 | 0.4812 ASV_14;size=130846 | UDB02651623 | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified |   | 0.9607800000000001 | N/A | N/A ASV_15;size=978833 | UDB03975281 | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified | unidentified |   | 1 | N/A | N/A

Classified file

<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

ID | Given label | Prediction | Full classification | Rank | Cut-off | Confidence | ReferenceID | BLAST score | BLAST sim | BLAST coverage -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- ASV_1;size=2362642 |   | k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified | N/A | N/A | UDB05261093 | 1 | 1 | 178 ASV_2;size=1037588 |   | k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified | N/A | N/A | MW856689 | 1 | 1 | 132 ASV_3;size=1412752 |   | k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified | N/A | N/A | UDB01614261 | 0.5425 | 0.875 | 31 ASV_4;size=2201923 |   | k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified | N/A | N/A | UDB03085057 | 1 | 1 | 160 ASV_5;size=3340601 |   | k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified | N/A | N/A | UDB03119248 | 1 | 1 | 154 ASV_6;size=823557 |   | k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified | N/A | N/A | UDB01261625 | 0.9823000000000001 | 0.9823000000000001 | 112 ASV_7;size=830877 |   | k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified | N/A | N/A | MW214811 | 1 | 1 | 157 ASV_8;size=4359501 |   | k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified | N/A | N/A | UDB05107909 | 1 | 1 | 151 ASV_9;size=408829 |   | k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified | N/A | N/A | MZ016271 | 0.7246504 | 0.88372 | 41 ASV_10;size=176701 |   | k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified | N/A | N/A | UDB05818296 | 1 | 1 | 179 ASV_11;size=162862 |   | k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified | N/A | N/A | UDB04293913 | 1 | 1 | 141 ASV_12;size=1535429 |   | k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified | N/A | N/A | MT991106 | 1 | 1 | 146 ASV_13;size=169945 | Sphaerosoma | k__Fungi;p__Ascomycota;c__Pezizomycetes;o__Pezizales;f__Pezizales_fam_Incertae_sedis;g__Sphaerosoma;s__unidentified | genus | 0.969 | 0.4812 | UDB07371928 | 0.98324 | 0.98324 | 177 ASV_14;size=130846 |   | k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified | N/A | N/A | UDB02651623 | 0.9607800000000001 | 0.9607800000000001 | 50 ASV_15;size=978833 |   | k__unidentified;p__unidentified;c__unidentified;o__unidentified;f__unidentified;g__unidentified;s__unidentified | N/A | N/A | UDB03975281 | 1 | 1 | 135

vuthuyduong commented 2 months ago

Dear Luuk,

This file dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch looks fine to me. I tried to create a bestmatch file containing the following lines:

ID ReferenceID BLAST score BLAST sim BLAST coverage ASV_1;size=2362642 UDB05261093 1 1 178 ASV_2;size=1037588 MW856689 1 1 132 ASV_3;size=1412752 UDB01614261 0.5425 0.875 31 ASV_4;size=2201923 UDB03085057 1 1 160 ASV_5;size=3340601 UDB03119248 1 1 154 ASV_6;size=823557 UDB01261625 0.9823000000000001 0.9823000000000001 112 ASV_7;size=830877 MW214811 1 1 157 ASV_8;size=4359501 UDB05107909 1 1 151 ASV_9;size=408829 MZ016271 0.7246504 0.88372 41 ASV_10;size=176701 UDB05818296 1 1 179 ASV_11;size=162862 UDB04293913 1 1 141 ASV_12;size=1535429 MT991106 1 1 146 ASV_13;size=169945 UDB07371928 0.98324 0.98324 177 ASV_14;size=130846 UDB02651623 0.9607800000000001 0.9607800000000001 50 ASV_15;size=978833 UDB03975281 1 1 135

and run the command:

python dnabarcoder/dnabarcoder.py classify -i dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch -c ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.classification -cutoffs ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.unique.cutoffs.best.json

And here is what I have obtained:

ID ReferenceID kingdom phylum class order family genus species rank score cutoff confidence ASV_1;size=2362642 UDB05261093 Fungi Ascomycota Dothideomycetes Dothideales unidentified unidentified unidentified order 1.0 0.925 0.5634 ASV_2;size=1037588 MW856689 Fungi Basidiomycota Tremellomycetes Tremellales Bulleribasidiaceae Vishniacozyma unidentified genus 1.0 0.969 0.4812 ASV_3;size=1412752 UDB01614261 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 0.5425 N/A N/A ASV_4;size=2201923 UDB03085057 Fungi Basidiomycota Microbotryomycetes Kriegeriales Camptobasidiaceae Glaciozyma unidentified genus 1.0 0.969 0.4812 ASV_5;size=3340601 UDB03119248 Fungi Ascomycota Dothideomycetes Cladosporiales Cladosporiaceae Cladosporium unidentified genus 1.0 0.969 0.4812 ASV_6;size=823557 UDB01261625 Fungi Ascomycota Archaeorhizomycetes Archaeorhizomycetales Archaeorhizomycetaceae Archaeorhizomyces unidentified genus 0.9823000000000001 0.969 0.4812 ASV_7;size=830877 MW214811 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1.0 N/A N/A ASV_8;size=4359501 UDB05107909 Fungi Ascomycota Dothideomycetes Mycosphaerellales Teratosphaeriaceae Devriesia unidentified genus 1.0 0.969 0.4812 ASV_9;size=408829 MZ016271 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 0.7246504 N/A N/A ASV_10;size=176701 UDB05818296 Fungi Ascomycota Pezizomycetes Pezizales Pyronemataceae Pyronema unidentified genus 1.0 0.951 0.6336 ASV_11;size=162862 UDB04293913 Fungi Ascomycota Sordariomycetes Amphisphaeriales Sporocadaceae Pestalotiopsis unidentified genus 1.0 0.969 0.4812 ASV_12;size=1535429 MT991106 Fungi Ascomycota Sordariomycetes Hypocreales Nectriaceae Fusarium Fusarium solani species 1.0 0.988 0.8358 ASV_13;size=169945 UDB07371928 Fungi Ascomycota Pezizomycetes Pezizales Pezizales fam Incertae sedis Sphaerosoma unidentified genus 0.98324 0.969 0.4812 ASV_14;size=130846 UDB02651623 Fungi Basidiomycota Agaricomycetes Geastrales Geastraceae unidentified unidentified family 0.9607800000000001 0.933 0.46 ASV_15;size=978833 UDB03975281 Fungi Ascomycota Sordariomycetes Hypocreales Clavicipitaceae Metarhizium unidentified genus 1.0 0.984 0.6507 Could this be related to a memory problem? Can you also try running the classify command with a small file to see if it works?

Otherwise, we can always set up a meeting if you prefer.

Best regards, Duong

Best regards Duong

On Wed, 7 Aug 2024 at 12:20, Luke Florence @.***> wrote:

Hi Vuthuyduong,

I have followed your pipeline for the classification of some ASVs. My reads are ITS1 extracted, and I’ve used the ITS1 extracted UNITE v10 database that you prepared (thank you!). However, most of my ASVs (~85%) are unassigned at the fungi level after classification. This doesn’t make sense to me, as the majority of the unassigned ASVs had coverage > 90% and similarity > 95% when I previously BLASTed them. And a good portion had coverage = 100% and similarity > 98%.

Below is the head of the “bestmatch” file and the “classified” and “classification” files. I have also included your script, which I slightly modified to run on the cluster and fit my project (perhaps I made an error here?), as well as the SLURM file.

There is one error in the SLURM file: “sh: ImportText.pl: command not found”. I think this is related to the krona.html file, which was not written.

Have I made an error, or do I not understand how the classification is supposed to work?

Thank you in advance for your help.

Luke Script

Constants and subdirectories

readonly THREADS=8 readonly REFERENCE_SEQUENCES="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta" readonly BEST_MATCH="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.unique.cutoffs.best.json" readonly CLASSIFIER="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.classification" readonly QUERY_SEQUENCES="../../data/bioinformatics/08.ASVs/ASVs.fasta" readonly OUTPUT="../../data/bioinformatics/09.Taxonomy"

log 'Starting at:'

Search for the best matches of the sequences

python dnabarcoder/dnabarcoder.py search -i $QUERY_SEQUENCES -r $REFERENCE_SEQUENCES -ml 50

Assign the sequences to different taxonomic groups

python dnabarcoder/dnabarcoder.py classify -i dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch -c $CLASSIFIER -cutoffs $BEST_MATCH

Move the classification files to the taxonomy subdirectory

mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classified $OUTPUT/ASVs.unite2024ITS1_BLAST.classified.txt mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classification $OUTPUT/ASVs.unite2024ITS1_BLAST.classification.txt

log 'Finished at:' SLURM

Starting at: Sat Jul 27 06:40:17 AEST 2024

Building a new DB, current time: 07/27/2024 06:44:02 New DB name: /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb New DB title: ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta Sequence type: Nucleotide Deleted existing Nucleotide BLAST database named /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb Keep MBits: T Maximum file size: 3000000000B FASTA-Reader: Ignoring invalid residues at position(s): On line 629439: 57 FASTA-Reader: Ignoring invalid residues at position(s): On line 629440: 1-7 Adding sequences from FASTA; added 1899789 sequences in 21.573 seconds.

makeblastdb -in ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta -dbtype 'nucl' -out ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb blastn -query ../../data/bioinformatics/08.ASVs/ASVs.indexed.fasta -db ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb -task blastn-short -outfmt 6 -out ../../data/bioinformatics/08.ASVs/ASVs.unite2024ITS1.blastoutput -num_threads 96 The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch

sh: ImportText.pl: command not found Number of classified sequences: 22362 The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.classified and dnabarcoder/ASVs.unite2024ITS1_BLAST.classification. The krona report and html are saved in files dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.report and dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.html.

Finished at: Sat Jul 27 20:39:15 AEST 2024 Bestmatch file ID ReferenceID BLAST score BLAST sim BLAST coverage ASV_1;size=2362642 UDB05261093 1 1 178 ASV_2;size=1037588 MW856689 1 1 132 ASV_3;size=1412752 UDB01614261 0.5425 0.875 31 ASV_4;size=2201923 UDB03085057 1 1 160 ASV_5;size=3340601 UDB03119248 1 1 154 ASV_6;size=823557 UDB01261625 0.9823000000000001 0.9823000000000001 112 ASV_7;size=830877 MW214811 1 1 157 ASV_8;size=4359501 UDB05107909 1 1 151 ASV_9;size=408829 MZ016271 0.7246504 0.88372 41 ASV_10;size=176701 UDB05818296 1 1 179 ASV_11;size=162862 UDB04293913 1 1 141 ASV_12;size=1535429 MT991106 1 1 146 ASV_13;size=169945 UDB07371928 0.98324 0.98324 177 ASV_14;size=130846 UDB02651623 0.9607800000000001 0.9607800000000001 50 ASV_15;size=978833 UDB03975281 1 1 135 Classification file ID ReferenceID kingdom phylum class order family genus species rank score cutoff confidence ASV_1;size=2362642 UDB05261093 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_2;size=1037588 MW856689 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_3;size=1412752 UDB01614261 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 0.5425 N/A N/A ASV_4;size=2201923 UDB03085057 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_5;size=3340601 UDB03119248 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_6;size=823557 UDB01261625 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 0.9823000000000001 N/A N/A ASV_7;size=830877 MW214811 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_8;size=4359501 UDB05107909 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_9;size=408829 MZ016271 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 0.7246504 N/A N/A ASV_10;size=176701 UDB05818296 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_11;size=162862 UDB04293913 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_12;size=1535429 MT991106 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_13;size=169945 UDB07371928 Fungi Ascomycota Pezizomycetes Pezizales Pezizales fam Incertae sedis Sphaerosoma unidentified genus 0.98324 0.969 0.4812 ASV_14;size=130846 UDB02651623 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 0.9607800000000001 N/A N/A ASV_15;size=978833 UDB03975281 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A Classified file ID Given label Prediction Full classification Rank Cut-off Confidence ReferenceID BLAST score BLAST sim BLAST coverage ASV_1;size=2362642 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A UDB05261093 1 1 178 ASV_2;size=1037588 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;s__unidentified N/A N/A MW856689 1 1 132 ASV_3;size=1412752 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A UDB01614261 0.5425 0.875 31 ASV_4;size=2201923 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A UDB03085057 1 1 160 ASV_5;size=3340601 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;s__unidentified N/A N/A UDB03119248 1 1 154 ASV_6;size=823557 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A UDB01261625 0.9823000000000001 0.9823000000000001 112 ASV_7;size=830877 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A MW214811 1 1 157 ASV_8;size=4359501 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;s__unidentified N/A N/A UDB05107909 1 1 151 ASV_9;size=408829 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A MZ016271 0.7246504 0.88372 41 ASV_10;size=176701 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A UDB05818296 1 1 179 ASV_11;size=162862 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;s__unidentified N/A N/A UDB04293913 1 1 141 ASV_12;size=1535429 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A MT991106 1 1 146 ASV_13;size=169945 Sphaerosoma kFungi;pAscomycota;cPezizomycetes;oPezizales;fPezizales_fam_Incertae_sedis;gSphaerosoma;sunidentified genus 0.969 0.4812 UDB07371928 0.98324 0.98324 177 ASV_14;size=130846 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A UDB02651623 0.9607800000000001 0.9607800000000001 50 ASV_15;size=978833 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;s__unidentified N/A N/A UDB03975281 1 1 135

— Reply to this email directly, view it on GitHub https://github.com/vuthuyduong/dnabarcoder/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF6CZMTMZK34CPWCEJVF6MDZQHYF7AVCNFSM6AAAAABMEC5EQOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TGMJSGE3TSNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

vuthuyduong commented 2 months ago

For the Krona problem, I have noticed that it is related to the setup of Krona. I will change the code to make it work in any environment.

Best regards Duong

On Wed, 7 Aug 2024 at 12:20, Luke Florence @.***> wrote:

Hi Vuthuyduong,

I have followed your pipeline for the classification of some ASVs. My reads are ITS1 extracted, and I’ve used the ITS1 extracted UNITE v10 database that you prepared (thank you!). However, most of my ASVs (~85%) are unassigned at the fungi level after classification. This doesn’t make sense to me, as the majority of the unassigned ASVs had coverage > 90% and similarity > 95% when I previously BLASTed them. And a good portion had coverage = 100% and similarity > 98%.

Below is the head of the “bestmatch” file and the “classified” and “classification” files. I have also included your script, which I slightly modified to run on the cluster and fit my project (perhaps I made an error here?), as well as the SLURM file.

There is one error in the SLURM file: “sh: ImportText.pl: command not found”. I think this is related to the krona.html file, which was not written.

Have I made an error, or do I not understand how the classification is supposed to work?

Thank you in advance for your help.

Luke Script

Constants and subdirectories

readonly THREADS=8 readonly REFERENCE_SEQUENCES="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta" readonly BEST_MATCH="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.unique.cutoffs.best.json" readonly CLASSIFIER="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.classification" readonly QUERY_SEQUENCES="../../data/bioinformatics/08.ASVs/ASVs.fasta" readonly OUTPUT="../../data/bioinformatics/09.Taxonomy"

log 'Starting at:'

Search for the best matches of the sequences

python dnabarcoder/dnabarcoder.py search -i $QUERY_SEQUENCES -r $REFERENCE_SEQUENCES -ml 50

Assign the sequences to different taxonomic groups

python dnabarcoder/dnabarcoder.py classify -i dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch -c $CLASSIFIER -cutoffs $BEST_MATCH

Move the classification files to the taxonomy subdirectory

mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classified $OUTPUT/ASVs.unite2024ITS1_BLAST.classified.txt mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classification $OUTPUT/ASVs.unite2024ITS1_BLAST.classification.txt

log 'Finished at:' SLURM

Starting at: Sat Jul 27 06:40:17 AEST 2024

Building a new DB, current time: 07/27/2024 06:44:02 New DB name: /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb New DB title: ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta Sequence type: Nucleotide Deleted existing Nucleotide BLAST database named /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb Keep MBits: T Maximum file size: 3000000000B FASTA-Reader: Ignoring invalid residues at position(s): On line 629439: 57 FASTA-Reader: Ignoring invalid residues at position(s): On line 629440: 1-7 Adding sequences from FASTA; added 1899789 sequences in 21.573 seconds.

makeblastdb -in ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta -dbtype 'nucl' -out ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb blastn -query ../../data/bioinformatics/08.ASVs/ASVs.indexed.fasta -db ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb -task blastn-short -outfmt 6 -out ../../data/bioinformatics/08.ASVs/ASVs.unite2024ITS1.blastoutput -num_threads 96 The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch

sh: ImportText.pl: command not found Number of classified sequences: 22362 The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.classified and dnabarcoder/ASVs.unite2024ITS1_BLAST.classification. The krona report and html are saved in files dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.report and dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.html.

Finished at: Sat Jul 27 20:39:15 AEST 2024 Bestmatch file ID ReferenceID BLAST score BLAST sim BLAST coverage ASV_1;size=2362642 UDB05261093 1 1 178 ASV_2;size=1037588 MW856689 1 1 132 ASV_3;size=1412752 UDB01614261 0.5425 0.875 31 ASV_4;size=2201923 UDB03085057 1 1 160 ASV_5;size=3340601 UDB03119248 1 1 154 ASV_6;size=823557 UDB01261625 0.9823000000000001 0.9823000000000001 112 ASV_7;size=830877 MW214811 1 1 157 ASV_8;size=4359501 UDB05107909 1 1 151 ASV_9;size=408829 MZ016271 0.7246504 0.88372 41 ASV_10;size=176701 UDB05818296 1 1 179 ASV_11;size=162862 UDB04293913 1 1 141 ASV_12;size=1535429 MT991106 1 1 146 ASV_13;size=169945 UDB07371928 0.98324 0.98324 177 ASV_14;size=130846 UDB02651623 0.9607800000000001 0.9607800000000001 50 ASV_15;size=978833 UDB03975281 1 1 135 Classification file ID ReferenceID kingdom phylum class order family genus species rank score cutoff confidence ASV_1;size=2362642 UDB05261093 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_2;size=1037588 MW856689 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_3;size=1412752 UDB01614261 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 0.5425 N/A N/A ASV_4;size=2201923 UDB03085057 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_5;size=3340601 UDB03119248 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_6;size=823557 UDB01261625 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 0.9823000000000001 N/A N/A ASV_7;size=830877 MW214811 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_8;size=4359501 UDB05107909 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_9;size=408829 MZ016271 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 0.7246504 N/A N/A ASV_10;size=176701 UDB05818296 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_11;size=162862 UDB04293913 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_12;size=1535429 MT991106 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A ASV_13;size=169945 UDB07371928 Fungi Ascomycota Pezizomycetes Pezizales Pezizales fam Incertae sedis Sphaerosoma unidentified genus 0.98324 0.969 0.4812 ASV_14;size=130846 UDB02651623 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 0.9607800000000001 N/A N/A ASV_15;size=978833 UDB03975281 unidentified unidentified unidentified unidentified unidentified unidentified unidentified 1 N/A N/A Classified file ID Given label Prediction Full classification Rank Cut-off Confidence ReferenceID BLAST score BLAST sim BLAST coverage ASV_1;size=2362642 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A UDB05261093 1 1 178 ASV_2;size=1037588 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;s__unidentified N/A N/A MW856689 1 1 132 ASV_3;size=1412752 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A UDB01614261 0.5425 0.875 31 ASV_4;size=2201923 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A UDB03085057 1 1 160 ASV_5;size=3340601 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;s__unidentified N/A N/A UDB03119248 1 1 154 ASV_6;size=823557 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A UDB01261625 0.9823000000000001 0.9823000000000001 112 ASV_7;size=830877 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A MW214811 1 1 157 ASV_8;size=4359501 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;s__unidentified N/A N/A UDB05107909 1 1 151 ASV_9;size=408829 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A MZ016271 0.7246504 0.88372 41 ASV_10;size=176701 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A UDB05818296 1 1 179 ASV_11;size=162862 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;s__unidentified N/A N/A UDB04293913 1 1 141 ASV_12;size=1535429 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A MT991106 1 1 146 ASV_13;size=169945 Sphaerosoma kFungi;pAscomycota;cPezizomycetes;oPezizales;fPezizales_fam_Incertae_sedis;gSphaerosoma;sunidentified genus 0.969 0.4812 UDB07371928 0.98324 0.98324 177 ASV_14;size=130846 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;sunidentified N/A N/A UDB02651623 0.9607800000000001 0.9607800000000001 50 ASV_15;size=978833 kunidentified;punidentified;cunidentified;ounidentified;funidentified;gunidentified;s__unidentified N/A N/A UDB03975281 1 1 135

— Reply to this email directly, view it on GitHub https://github.com/vuthuyduong/dnabarcoder/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF6CZMTMZK34CPWCEJVF6MDZQHYF7AVCNFSM6AAAAABMEC5EQOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ2TGMJSGE3TSNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

LukeLikesDirt commented 2 months ago

Dear Duong

When I run the classification step on the first 15 ASVs (as you have done above), we get different results. My results are the same as when I ran the entire best match file, so the issue is not a memory problem, which I expected because I requested a massive amount of memory from the cluster.

However, I suspect the issue could be with the unite2024ITS1.classification file that I downloaded from Zenodo along with the unite2024ITS1.fasta file. I noticed that the unite2024ITS1.classification file is 82.5 MB, whereas it is 262.6 MB and 229.4 MB for the ITS and ITS2 versions, respectively. How large is the unite2024ITS1.classification file that you are using?

I think the unite2024ITS1.classification file did not upload correctly to Zenodo, resulting in many reference taxa being missing. For example, of the classifications that you get, only ASV_13;size=169945 (UDB07371928, Sphaerosoma) is found in my unite2024ITS1.classification file. This makes sense because that is the only classification we have in common (the rest of mine remain unidentified at the kingdom level).

Could you please either reupload the unite2024ITS1.classification file to Zenodo or share a folder with me to access the unite2024ITS1.classification file that you use?

Thank you kindly in advance for your time.

Warm regards Luke

vuthuyduong commented 2 months ago

Dear Luke,

Thank you very much. Yes, you were right. Somehow, the upload of the unite2024ITS1.classification file went wrong. I've created a new Zenodo record available at https://zenodo.org/records/13336328, containing all UNITE ITS1, ITS2, and ITS sequences, along with their classifications, ready for use with dnabarcoder. Please let me know if you still encounter any issues.

Best, Duong

On Fri, 16 Aug 2024 at 22:02, Luke Florence @.***> wrote:

Dear Duong

When I run the classification step on the first 15 ASVs (as you have done above), we get different results. My results are the same as when I ran the entire best match file, so the issue is not a memory problem, which I expected because I requested a massive amount of memory from the cluster.

However, I suspect the issue could be with the unite2024ITS1.classification https://zenodo.org/records/12580255/files/unite2024ITS1.classification?download=1 file that I downloaded from Zenodo along with the unite2024ITS1.fasta file. I noticed that the unite2024ITS1.classification file is 82.5 MB, whereas it is 262.6 MB and 229.4 MB for the ITS and ITS2 versions, respectively. How large is the unite2024ITS1.classification file that you are using?

I think the unite2024ITS1.classification file did not upload correctly to Zenodo, resulting in many reference taxa being missing. For example, of the classifications that you get, only ASV_13;size=169945 (UDB07371928, Sphaerosoma) is found in my unite2024ITS1.classification file. This makes sense because that is the only classification we have in common (the rest of mine remain unidentified at the kingdom level).

Could you please either reupload the unite2024ITS1.classification file to Zenodo or share a folder with me to access the unite2024ITS1.classification file that you use?

Thank you kindly in advance for your time.

Warm regards Luke

— Reply to this email directly, view it on GitHub https://github.com/vuthuyduong/dnabarcoder/issues/2#issuecomment-2294152337, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF6CZMWYEYHB7Q3XKYPIGP3ZRZLEBAVCNFSM6AAAAABMEC5EQOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJUGE2TEMZTG4 . You are receiving this because you commented.Message ID: @.***>

LukeLikesDirt commented 2 months ago

Hi Duong

Thank you for updating the classification file. My classification output now makes sense.

Kind regards Luke