jiarong / VirSorter2

customizable pipeline to identify viral sequences from (meta)genomic data
GNU General Public License v2.0
225 stars 31 forks source link

Total number of viral scaffolds in combined dsDNAphage and ssDNA run < the sum of the two runs done separately #86

Closed Binvir closed 3 years ago

Binvir commented 3 years ago

Hi there,

Firstly, thank you for a very thorough virus-detection tool. I have used VS2 (ver. 2.2.3) on a single dataset three separate times with the only difference being in the include groups flag: 1) --include-groups dsDNAphage,ssDNA, 2) --include-groups dsDNAphage, 3) VS2 --include-groups ssDNA. The respective counts in "final-viral-combined.fa" were as follows: 3307, 3301, and 897. I was wondering why the combined total of the two classifiers separately exceeds that of the counts when they are used together. Please let me know which approach you'd recommend if I am interested in detecting both dsDNAphage and ssDNA viruses (namely phage). Thank you for your time!

Nikhil

jiarong commented 3 years ago

Hi, thanks for the feedback. I recommend 1). VirSorter2's function is limited to separate viral from non-viral sequences and thus its models are only trained to differentiate between viral vs non-viral, not to differentiate among viral groups. Your results indicate ssDNA has most hits overlap with dsDNAphage model. If you are thinking about separating the ssDNA hits from dsDNAphage hits, VirSorter2 is not design to do that.

Binvir commented 3 years ago

Hi, thank you for the prompt and clear response - I will go with option 1).

Best, Nikhil