Virome decontamination setting on non-viral metagenomes

Binvir commented 5 years ago

Hi Simon,

Firstly, thank you for this tool - it has been of great use in a viral identification pipeline. I had a question about the virome decontamination setting for the CyVerse implementation of VirSorter. I ran a metagenome generated from a sample, collected on a 0.1μm filter, that was not treated or processed in any way to specifically enrich for viruses. Out of curiosity, I ran VirSorter on this metagenome with and without the virome decontamination function. In the case of the former, cat-1 and cat-2 predictions were more than double those obtained if the virome decontamination setting was not applied. Could you please comment as to what may be going on here? I am not sure what approach is best for my data.

Thank you for your time!

simroux commented 5 years ago

Hi ! So this is something we've seen in the past: if your dataset is composed mostly of viruses (~ 30% or more), then "virome decontamination" works better than "regular" mode. This is the case for viromes, but also typically for small size fraction (see e.g. https://www.nature.com/articles/s41564-018-0225-4). So the short answer is: the "virome decontamination" is probably best for your data.

Best, Simon

Binvir commented 5 years ago

Hi Simon,

Thank you for the informative and punctual reply - I will proceed with using the virome decontamination setting for my future VirSorter runs.

jarrodscott commented 4 years ago

Hi @simroux and @Binvir

I realize this issue is closed but I had a related question that seems to fit best here. I have two non-viral metagenome assemblies that are part of the same study. After a normal virsorter run I get this message for only one of the assemblies

More than 25% of the bp in contigs >= 10kb were predicted as viral (estimated ratio: 28.77%)...
You may want to use the virome decontamination mode on this dataset, as it seems to have lot of viruses

My question is whether it is better to use --virome decontamination on both datasets for consistency or use it only on the one high in putative virus contigs?

simroux commented 4 years ago

Hi Jarrod,

That is a good question. The "warning" message is a recent addition, so I'm reopening the issue as it may be useful to a number of folks.

My recommendation here would be to re-run VirSorter in virome decontamination mode for this one assembly, and compare the results to the "regular" mode. By comparing the results, I mean counting how many additional viral contigs are identified, and manually inspecting a few of them. If these look like genuine viruses, then I would recommend running everything as "virome decontamination", for consistency.

A quick background on this: regular VirSorter estimates a number of parameters (ratio of genes with PFAM hits, ratio of genes with viral hits, average gene size, etc) from the data themselves. The original idea was that it would be more accurate to estimate these from the whole (mostly microbial) dataset, and then look for sequences that look "more viral than the average". In some cases, e.g. if a dataset has a substantial portion of viruses, this approach fails as the "average" is already viral. So the "virome decontamination" mode uses pre-computed parameters from bacterial/archaeal genomes in RefSeq instead of estimates from the data. Eventually, this virome decontamination mode tends to work with all kinds of datasets (microbial or viral) is sensitive to "unusual" microbial genomes (i.e. not well captured by RefSeq genomes).

jarrodscott commented 4 years ago

Hi Simon, Excellent, good idea. As a preemptive measure, I ran both assemblies using regular and --virome :-) and the analyses just finished earlier today. For both assemblies the --virome results have ~5x the number of entries in the VIRSorter_global-phage-signal file as the regular settings. I am looking through the data now. I will provide a summary here when I am finished...

for the record, here are the commands I ran for each assembly...

wrapper_phage_contigs_sorter_iPlant.pl -f WA-contigs.fa --ncpu 25 --db 2 --wdir VIRSORTER/WA --data-dir virsorter-data --diamond
wrapper_phage_contigs_sorter_iPlant.pl -f WA-contigs.fa --ncpu 25 --db 2 --wdir VIRSORTER_VIROME/WA --data-dir virsorter-data --diamond --virome

jarrodscott commented 4 years ago

Hi Simon

Here are some details. Maybe too many, maybe too few :)

The two datasets:

EP: 753612 contigs WA: 574305 contigs. This was the assembly VirSorter flagged for decontamination.

For the EP dataset, only 12 hits were detected in the regular and not in the virome. Similarly in the WA dataset, only 9 hits were detected in the regular and not in the virome. So, --virome seems to pick up everything regular does. When the two methods overlap, they appear to assign the same category. I eyeballed this, so not quantitative.

The one thing I am not sure about is the "best way" to test the hits. Suggestions? Straight up blastp against nr or RefSeq? I tested a handful of genes from a handful of contigs and basically most hits are low percent identity and hypothetical proteins from an assortment of taxa. Some high percent phage hits. Definitely not seeing anything that screams microbe and not virus.

This table breaks down how the number of hits by treatment for each category from each dataset.

category	EP-REGULAR	EP-VIROME	per_change	WA-REGULAR	WA-VIROME	per_change
1_Complete_phage_contigs_cat_1_sure	1735	5779	233.08	2163	6128	183.31
2_Complete_phage_contigs_cat_2_some_what_sure	9869	41561	321.13	8956	37571	319.51
3_Complete_phage_contigs_cat_3_not_so_sure	285	3596	1161.75	145	4238	2822.76
4_Prophages_cat_1_sure	8	2	-75.00	70	0	-100.00
5_Prophages_cat_2_some_what_sure	54	25	-53.70	244	23	-90.57
6_Prophages_cat_3_not_so_sure	4	19	375.00	4	32	700.00

simroux commented 4 years ago

Hi Jarrod, Thanks for the additional information :-) Here is how I typically look at these cases:

The file "VIRSorter_affi-contigs.tab" in the folder "Metric_files" has the gene-by-gene annotation that VirSorter uses to make its calls. Columns are: Gene ID, Start, Stop, Length, Strand, Hit in phage cluster database, score, e-value, Phage cluster category, Hit in pfam, score, e-value. You should be able to "grep" individual contigs from this file (adding "-gene" at the end of the contig name if needed). I like to look at this file because these are the exact data VirSorter looked at to calculate enrichment / depletion statistics, and so if a contig is almost entirely unknown, with e.g. only ~ 10% PFAM, but was not considered as "depleted in PFAM hits" in the regular mode, I know that I should use the virome decontamination mode.
In your case, based on the results you've seen here for categories 1 sequences, it seems like you should use the virome decontamination mode. Basically VirSorter uses 2 types of metrics: viral hallmark gene (which work the same in regular vs virome mode since it's simple presence/absence) and enrichments/depletion stats (which will be different between regular and virome modes). For a contig to be category 1, it needs to be significant in some enrichment stat + have ≥1 hallmark gene. The fact that you find "new" category 1 contigs means that there were sequences with a hallmark gene (so most are likely viral), were not considered enriched in viral genes or depleted in pfam in regular mode, but are considered enriched/depleted in virome mode. That suggests to me that the background stats computed from the whole dataset were too "stringent", i.e. the overall percentage of phage cluster & pfam affiliation was too similar to a "normal" virus genome, and eventually there was very little significant enrichment/depletion.

Hopefully this is somewhat clear, please let me know if you have any question or if the data doesn't seem to match my assumptions :-) Btw, are these metagenomes anything special, e.g. samples filtered in a specific way ?

jarrodscott commented 4 years ago

Hi Simon

Great explanation! I will look at those data today. And the EP assembly has a lot of viral hits but it looks like it was a little below VirSorters warning threshold. So I agree that decontamination for both sets is looking like the best way forward.

These are near shore marine samples, collected at 3-5m deep and filtered through 0.22 micro filters. Even Kaiju classification on the contigs and Kraken on the short reads ( for each sample) shows a relatively high percentage of viral hits, like 15% of total datasets. I honestly don’t have enough experience with these kinds of samples to know if this is “normal” or not.

On Wed, Dec 11, 2019 at 10:40 simroux notifications@github.com wrote:

Hi Jarrod, Thanks for the additional information :-) Here is how I typically look at these cases:

-

The file "VIRSorter_affi-contigs.tab" in the folder "Metric_files" has the gene-by-gene annotation that VirSorter uses to make its calls. Columns are: Gene ID, Start, Stop, Length, Strand, Hit in phage cluster database, score, e-value, Phage cluster category, Hit in pfam, score, e-value. You should be able to "grep" individual contigs from this file (adding "-gene" at the end of the contig name if needed). I like to look at this file because these are the exact data VirSorter looked at to calculate enrichment / depletion statistics, and so if a contig is almost entirely unknown, with e.g. only ~ 10% PFAM, but was not considered as "depleted in PFAM hits" in the regular mode, I know that I should use the virome decontamination mode.

In your case, based on the results you've seen here for categories 1 sequences, it seems like you should use the virome decontamination mode. Basically VirSorter uses 2 types of metrics: viral hallmark gene (which work the same in regular vs virome mode since it's simple presence/absence) and enrichments/depletion stats (which will be different between regular and virome modes). For a contig to be category 1, it needs to be significant in some enrichment stat + have ≥1 hallmark gene. The fact that you find "new" category 1 contigs means that there were sequences with a hallmark gene (so most are likely viral), were not considered enriched in viral genes or depleted in pfam in regular mode, but are considered enriched/depleted in virome mode. That suggests to me that the background stats computed from the whole dataset were too "stringent", i.e. the overall percentage of phage cluster & pfam affiliation was too similar to a "normal" virus genome, and eventually there was very little significant enrichment/depletion.

Hopefully this is somewhat clear, please let me know if you have any question or if the data doesn't seem to match my assumptions :-) Btw, are these metagenomes anything special, e.g. samples filtered in a specific way ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simroux/VirSorter/issues/40?email_source=notifications&email_token=AD3RFNT4KIC77ALYXWLJ37TQYECYHA5CNFSM4HNYUG2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGTSKOQ#issuecomment-564602170, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD3RFNTH5CL2LL4CTU6367LQYECYHANCNFSM4HNYUG2A .

simroux commented 4 years ago

Hi Jarrod, "filtered through 0.22 micro filters" is the key thing here, and the results make a lot of sense. Metagenomes from cells collected on 0.22 micro filters look like "regular" metagenomes, but metagenomes from the filtrate below a 0.22 micro filter typically have tons of viruses and look more like a viral metagenome, even if no other steps are taken to enrich in viral particles. We've seen this in the past in e.g. rumen microbiome (https://www.nature.com/articles/s41564-018-0225-4), and we had to also use the virome decontamination mode on these size fractions :-)

Best, Simon

simroux / VirSorter

Virome decontamination setting on non-viral metagenomes #40