Open NailouZhang opened 1 year ago
Hi Nailou,
Thanks for your comment. It's a bit complicated to assess this without more information about how Cenote-Taker 2
was run and what settings you used with blast
and diamond
.
Using blastn
against nt
could be a great way to look for viruses present in this database and their close relatives, however, the vast majority of the viruses that exist on earth are not catalogued in nt
. Recent estimates suggest that there are around 1 billion virus species on earth. The number of virus species in nt
is in the tens of thousands.
Of course, as a general statement, Cenote-Taker 2
will return false positives at some unknown rate. If you are querying contigs assembled from WGS reads and you use -db virion --lin_minimum_hallmark_genes 2 --circ_minimum_hallmark_genes 2
, I would estimate the false positive rate is only about ~1%, maybe less. It's hard to measure this meaningfully, in my opinion.
Hi Mike, Recently, I ran cenote-taker2 and blastn against nt database & diamond against nr database with the contigs assembled by Megahit. I found that about 10000 sequences were classified as viruses, while about 1000 were identified by blast. I am confused about why the results from blast are ten times less than cenote-taker2.
As you pointed that "Many virus genomes are integrated into host chromosomes" and "viral genes and genomes are often misidentified as host sequences"(Tisza M J, Belford A K, Dominguez-Huerta G, et al. Cenote-Taker 2 democratizes virus discovery and sequence annotation[J]. Virus evolution, 2021, 7(1): veaa100.). Thus, blast may have some false-negatives results. So, Is there a threshold to classify sequences as viral or non-viral using both tools (e.g. blast p-value or percent of ident or mapping length)?
wish you a merry Christmas in advance!
Nailou Zhang