jiarong / VirSorter2

customizable pipeline to identify viral sequences from (meta)genomic data
GNU General Public License v2.0
221 stars 31 forks source link

final-viral-boundary.tsv and final-viral-score.tsv have different number of contigs/seqnames #68

Open flannsmith opened 3 years ago

flannsmith commented 3 years ago

Hi I'm just wondering why there would be less contigs appearing in the final-viral-score file (9808) as opposed to 12113 in the final-viral-boundary file? Can I include with confidence that all seqname's identified in the final-viral-boundary file are viral?

Also there are a number of lines in the viral-score file which are empty or don't include the % of confidence vote for each viral species but are ultimately deemed as dsDNAphage phage. Is that normal or should I filter these out?


dsDNAphage | ssDNA | RNA | NCLDV | lavidaviridae | max_score | max_score_group | length | hallmark | viral | cellular
NaN | NaN | NaN | NaN | NaN | NaN | dsDNAphage | 1526 | 1 | 100.0 | 0.0
NaN | NaN | NaN | NaN | NaN | NaN | dsDNAphage | 1517 | 1 | 100.0 | 0.0
NaN | NaN | NaN | NaN | NaN | NaN | dsDNAphage | 1508 | 1 | 50.0 | 0.0
NaN | NaN | NaN | NaN | NaN | NaN | dsDNAphage | 1504 | 1 | 100.0 | 0.0

Any insight much appreciated! Thanks.

jiarong commented 3 years ago

Hi, some contigs in the boundary file are remove if viral gene % is less than cellular gene %. I do not recommend including them since they are more likely to be false positive hits unless you can verify in other ways.

Those without confidence scores are short contigs with less than 2 complete genes but have hallmark genes. In you case, those hallmark genes are from dsDNAphage group.

flannsmith commented 3 years ago

@jiarong Just seeing your comment now for some reason. Thanks for your response!