mtisza1 / Cenote-Taker2

Cenote-Taker2: Discover and Annotate Divergent Viral Contigs (Please use Cenote-Taker 3 instead)
MIT License
57 stars 7 forks source link

cenote-taker2 vs (blastn nt & diamond nr) #42

Open NailouZhang opened 1 year ago

NailouZhang commented 1 year ago

Hi Mike, Recently, I ran cenote-taker2 and blastn against nt database & diamond against nr database with the contigs assembled by Megahit. I found that about 10000 sequences were classified as viruses, while about 1000 were identified by blast. I am confused about why the results from blast are ten times less than cenote-taker2.

As you pointed that "Many virus genomes are integrated into host chromosomes" and "viral genes and genomes are often misidentified as host sequences"(Tisza M J, Belford A K, Dominguez-Huerta G, et al. Cenote-Taker 2 democratizes virus discovery and sequence annotation[J]. Virus evolution, 2021, 7(1): veaa100.). Thus, blast may have some false-negatives results. So, Is there a threshold to classify sequences as viral or non-viral using both tools (e.g. blast p-value or percent of ident or mapping length)?

wish you a merry Christmas in advance!

Nailou Zhang

mtisza1 commented 1 year ago

Hi Nailou,

Thanks for your comment. It's a bit complicated to assess this without more information about how Cenote-Taker 2 was run and what settings you used with blast and diamond.

Using blastn against nt could be a great way to look for viruses present in this database and their close relatives, however, the vast majority of the viruses that exist on earth are not catalogued in nt. Recent estimates suggest that there are around 1 billion virus species on earth. The number of virus species in nt is in the tens of thousands.

Of course, as a general statement, Cenote-Taker 2 will return false positives at some unknown rate. If you are querying contigs assembled from WGS reads and you use -db virion --lin_minimum_hallmark_genes 2 --circ_minimum_hallmark_genes 2, I would estimate the false positive rate is only about ~1%, maybe less. It's hard to measure this meaningfully, in my opinion.