jiarong / VirSorter2

customizable pipeline to identify viral sequences from (meta)genomic data
GNU General Public License v2.0
227 stars 31 forks source link

--prep-for-dramv --viral-gene-enrich-off filtering sequences out #89

Closed mlhoggard closed 3 years ago

mlhoggard commented 3 years ago

Hi there,

I'm looking to trial annotating putative viral contigs via DRAM-v and was wondering if there is a --prep-for-dramv setting whereby all filtering in VirSorter2 can be fully silenced (i.e. so that all input sequences are present in the output prep-for-dramv .fa and affi-contigs.tab files)?

I'm working with a few sets of putative viral contigs identified using multiple tools (including VirSorter2), and am looking to feed the full set back through VirSorter2 again, but this time only to generate the required files for annotation via DRAM-v. However, I've noticed that this second VirSorter2 step is still filtering some contigs out. With a set of DNA viruses, this is a small subset, but with a set of putative RNA viruses approximately half were filtered out.

I've attempted multiple variants of settings for the other parameters but with no luck, including:

virsorter run --seqname-suffix-off --viral-gene-enrich-off --provirus-off --prep-for-dramv --keep-original-seq --min-score 0 --min-length 0 --include-groups dsDNAphage,NCLDV,RNA,ssDNA,lavidaviridae ...

I'm unsure if it's the --include-groups step that's causing the remaining filtering? The default is currently only two groups, but I noticed in the work-in-progress updates that you are aiming to implement an all option, so I was wondering if simply listing all the groups is still omitting any sequences that don't get assigned to any one of the groups (whereas I'm assuming the all option is intended to keep all sequences regardless of whether they are assigned to a group?)?

Kind regards, Mike.

jiarong commented 3 years ago

Mike, I think the reason is that VirSorter2 requires contigs to have at least two genes unless there are hallmark genes detected. RNA viruses are typically shorter and have polyproteins, and thus more likely to not pass the 2 gene minimal requirement. The "all" is just to be a short cut for all groups, not an options to include all input sequences to be in ouput.

mlhoggard commented 3 years ago

Hi @jiarong,

Thanks for the quick reply. Ah ok, that makes sense then. In the case of polyproteins, would prodigal call this as a single gene or overlook it entirely? If the former, would there be any value in allowing an option to reduce the minimum gene count to one, or does VirSorter2 actually functionally require at least two rather than it simply being a preference for virus detection?

Thanks again, Mike.

jiarong commented 3 years ago

Prodigal can call polyprotein fairly well in my experience, although I have also seen cases that predicted genes look off, usually too many short genes, likely due to non-canonical translation mechanisms. VirSorter2 relies on a few key genomic features that require at least two genes. For short contigs, the extra AMG related info from DRAMv.py is not meaningful/reliable. I would just run DRAM.py or other annotation tools.

mlhoggard commented 3 years ago

Thanks @jiarong. Much appreciated for all the info (and all the work with VirSorter2 in general).