I have followed your pipeline for the classification of some ASVs. My reads are ITS1 extracted, and I’ve used the ITS1 extracted UNITE v10 database that you prepared (thank you!). However, most of my ASVs (~85%) are unassigned at the fungi level after classification. This doesn’t make sense to me, as the majority of the unassigned ASVs had coverage > 90% and similarity > 95% when I previously BLASTed them. And a good portion had coverage = 100% and similarity > 98%.
Below is the head of the “bestmatch” file and the “classified” and “classification” files. I have also included your script, which I slightly modified to run on the cluster and fit my project (perhaps I made an error here?), as well as the SLURM file.
There is one error in the SLURM file: “sh: ImportText.pl: command not found”. I think this is related to the krona.html file, which was not written.
Have I made an error, or do I not understand how the classification is supposed to work?
# Search for the best matches of the sequences
python dnabarcoder/dnabarcoder.py search \
-i $QUERY_SEQUENCES \
-r $REFERENCE_SEQUENCES \
-ml 50
# Assign the sequences to different taxonomic groups
python dnabarcoder/dnabarcoder.py classify \
-i dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch \
-c $CLASSIFIER \
-cutoffs $BEST_MATCH
# Move the classification files to the taxonomy subdirectory
mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classified $OUTPUT/ASVs.unite2024ITS1_BLAST.classified.txt
mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classification $OUTPUT/ASVs.unite2024ITS1_BLAST.classification.txt
log 'Finished at:'
SLURM
Starting at: Sat Jul 27 06:40:17 AEST 2024
Building a new DB, current time: 07/27/2024 06:44:02
New DB name: /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
New DB title: ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta
Sequence type: Nucleotide
Deleted existing Nucleotide BLAST database named /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
Keep MBits: T
Maximum file size: 3000000000B
FASTA-Reader: Ignoring invalid residues at position(s): On line 629439: 57
FASTA-Reader: Ignoring invalid residues at position(s): On line 629440: 1-7
Adding sequences from FASTA; added 1899789 sequences in 21.573 seconds.
makeblastdb -in ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta -dbtype 'nucl' -out ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb
blastn -query ../../data/bioinformatics/08.ASVs/ASVs.indexed.fasta -db ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb -task blastn-short -outfmt 6 -out ../../data/bioinformatics/08.ASVs/ASVs.unite2024ITS1.blastoutput -num_threads 96
The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch
sh: ImportText.pl: command not found
Number of classified sequences: 22362
The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.classified and dnabarcoder/ASVs.unite2024ITS1_BLAST.classification.
The krona report and html are saved in files dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.report and dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.html.
Hi Vuthuyduong,
I have followed your pipeline for the classification of some ASVs. My reads are ITS1 extracted, and I’ve used the ITS1 extracted UNITE v10 database that you prepared (thank you!). However, most of my ASVs (~85%) are unassigned at the fungi level after classification. This doesn’t make sense to me, as the majority of the unassigned ASVs had coverage > 90% and similarity > 95% when I previously BLASTed them. And a good portion had coverage = 100% and similarity > 98%.
Below is the head of the “bestmatch” file and the “classified” and “classification” files. I have also included your script, which I slightly modified to run on the cluster and fit my project (perhaps I made an error here?), as well as the SLURM file.
There is one error in the SLURM file: “sh: ImportText.pl: command not found”. I think this is related to the krona.html file, which was not written.
Have I made an error, or do I not understand how the classification is supposed to work?
Thank you in advance for your help.
Luke
Script
# Constants and subdirectories readonly THREADS=8 readonly REFERENCE_SEQUENCES="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta" readonly BEST_MATCH="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.unique.cutoffs.best.json" readonly CLASSIFIER="../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.classification" readonly QUERY_SEQUENCES="../../data/bioinformatics/08.ASVs/ASVs.fasta" readonly OUTPUT="../../data/bioinformatics/09.Taxonomy"
log 'Starting at:'
# Search for the best matches of the sequences python dnabarcoder/dnabarcoder.py search \ -i $QUERY_SEQUENCES \ -r $REFERENCE_SEQUENCES \ -ml 50
# Assign the sequences to different taxonomic groups python dnabarcoder/dnabarcoder.py classify \ -i dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch \ -c $CLASSIFIER \ -cutoffs $BEST_MATCH
# Move the classification files to the taxonomy subdirectory mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classified $OUTPUT/ASVs.unite2024ITS1_BLAST.classified.txt mv dnabarcoder/ASVs.unite2024ITS1_BLAST.classification $OUTPUT/ASVs.unite2024ITS1_BLAST.classification.txt
log 'Finished at:'
SLURM
Starting at: Sat Jul 27 06:40:17 AEST 2024
Building a new DB, current time: 07/27/2024 06:44:02 New DB name: /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb New DB title: ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta Sequence type: Nucleotide Deleted existing Nucleotide BLAST database named /data/group/frankslab/project/LFlorence/AusMycobiome/data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb Keep MBits: T Maximum file size: 3000000000B FASTA-Reader: Ignoring invalid residues at position(s): On line 629439: 57 FASTA-Reader: Ignoring invalid residues at position(s): On line 629440: 1-7 Adding sequences from FASTA; added 1899789 sequences in 21.573 seconds.
makeblastdb -in ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.fasta -dbtype 'nucl' -out ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb blastn -query ../../data/bioinformatics/08.ASVs/ASVs.indexed.fasta -db ../../data/bioinformatics/06.Reference_dataset/dnabarcoder/unite2024ITS1.blastdb -task blastn-short -outfmt 6 -out ../../data/bioinformatics/08.ASVs/ASVs.unite2024ITS1.blastoutput -num_threads 96 The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.bestmatch
sh: ImportText.pl: command not found Number of classified sequences: 22362 The results are saved in file dnabarcoder/ASVs.unite2024ITS1_BLAST.classified and dnabarcoder/ASVs.unite2024ITS1_BLAST.classification. The krona report and html are saved in files dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.report and dnabarcoder/ASVs.unite2024ITS1_BLAST.krona.html.
Finished at: Sat Jul 27 20:39:15 AEST 2024
Bestmatch file
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">
ID | ReferenceID | BLAST score | BLAST sim | BLAST coverage -- | -- | -- | -- | -- ASV_1;size=2362642 | UDB05261093 | 1 | 1 | 178 ASV_2;size=1037588 | MW856689 | 1 | 1 | 132 ASV_3;size=1412752 | UDB01614261 | 0.5425 | 0.875 | 31 ASV_4;size=2201923 | UDB03085057 | 1 | 1 | 160 ASV_5;size=3340601 | UDB03119248 | 1 | 1 | 154 ASV_6;size=823557 | UDB01261625 | 0.9823000000000001 | 0.9823000000000001 | 112 ASV_7;size=830877 | MW214811 | 1 | 1 | 157 ASV_8;size=4359501 | UDB05107909 | 1 | 1 | 151 ASV_9;size=408829 | MZ016271 | 0.7246504 | 0.88372 | 41 ASV_10;size=176701 | UDB05818296 | 1 | 1 | 179 ASV_11;size=162862 | UDB04293913 | 1 | 1 | 141 ASV_12;size=1535429 | MT991106 | 1 | 1 | 146 ASV_13;size=169945 | UDB07371928 | 0.98324 | 0.98324 | 177 ASV_14;size=130846 | UDB02651623 | 0.9607800000000001 | 0.9607800000000001 | 50 ASV_15;size=978833 | UDB03975281 | 1 | 1 | 135