Closed mscarbor closed 7 months ago
Never mind, I think I found my answer on Page 74 of the v1.6.3 manual related to the LCA approach. The only clarifying question is this... Let's say I have a read that's 100,000 nt long and has something like 50 ORFs identified. How many of these would have to agree on taxonomy to assign a read to a certain taxa? Is there more "forgiveness" with longer reads? In other words -- Can I expect more unclassified reads when using long reads than short reads?
Short answer: it depends
Basically long reads are treated like contigs, the taxonomy is calculated as the consensus of the ORFs using the same LCA+consensus algorithm described in the manual (@jtamames, correct me if I'm wrong). So longer reads may indeed not get classified at lower taxonomic ranks because of failure to reach consensus.
On the other hand when using the sqm_reads.pl
script for classifying short reads, each read is classified based on its best blastx hits against nr (without orf prediction) and there is thus no consensus between different taxonomy sources (only the LCA between the best hits). Some of these short reads may be unclassified, but if that region was part of a long read it may actually have been classified (since other ORFs in the long reads may have better classification).
Closing due to lack of activity, feel free to reopen.
Hello! This is a question more than an issue. We used SQM for functional and taxonomic classification of Nanopore reads. For taxonomic classification when using sqm_longreads.pl, are the taxonomy results based on a consensus taxonomy of the reads or is taxonomy assigned individually to each identified ORF? Asking another way, when I am looking at the taxonomy plots resulting from SQM Tools, am I looking at the taxonomy on an ORF-level or on a read-level? Sorry if I missed this in the literature, and thanks again for creating and maintaining a very beneficial tool.