jiarong / VirSorter2

customizable pipeline to identify viral sequences from (meta)genomic data
GNU General Public License v2.0
228 stars 31 forks source link

mitochondria classified as NCLDV #44

Open FWittmers opened 3 years ago

FWittmers commented 3 years ago

Hey!

First of all, really appreciate virsorter (the first and also this followup tool).
With the novel classification, I came across the fact that (circular) mitochondrial sequences are consistently classified as NCLDV. I would suppose that it would be quite straight-forward to recognise this by identifying the large amount of ribosomal RNA as a negative filter since this clearly sets the sequences apart from viral / NCLDV sequences and would improve the classification imo; interested to think what you think about this.

Best, Fabian Wittmers

jiarong commented 3 years ago

Good suggestion! We have seen NCLDV model have high false with eukaryotic sequences, but are not aware of the mitochondria specifically. Some manual checking should be done on those NCLDV hits scoring very low with other viral groups. What score range do you see on those mitochondria sequences?

FWittmers commented 3 years ago

I included the subset of the table that you viral-boundary table that you are referring to. All 3 are classified as mitochondria. Wondering why they are classified as viral at all because the viral score is 0?

Screen Shot 2021-02-25 at 21 05 01

I see the problem with the classification of the NCLDV, compared to dsDNAphages. It seems much more complex to develop a scoring system that does not miss NCLDV seqs that have eukaryotic / prokaryotic AMG proteins and not only NCVOGs. I am sure this is the broader scope you are thinking about. I for my part, am, trying to decide wether sequences in GV bins are actually all viral, etc, so this is constantly somewhere in my mind.

For the mitochondria I feel like it would be the easiest to screen for the ribosomal components (30S/50S) in there. These should certainly not show up in NCLDV (especially not all next to each other and as conserved in mitochondria). So you could exclude mitochondria through this independent of the greater NCLDV question.

jiarong commented 3 years ago

Right, NCLDV can have quite a few genes similar to hosts, which is more prone to false positives. The table you attached shows gene % from each group, not the scores. The score mentioned is the "max_score" column in final-viral-score.tsv. That's where you can further screen based on viral gene % and hallmark gene etc. In this case, you can remove those sequences with no viral genes to get high specificity. In the newest version 2.1, there is also a --viral-gene-required option to do that on the command line.