Open Binvir opened 5 years ago
Hi ! So this is something we've seen in the past: if your dataset is composed mostly of viruses (~ 30% or more), then "virome decontamination" works better than "regular" mode. This is the case for viromes, but also typically for small size fraction (see e.g. https://www.nature.com/articles/s41564-018-0225-4). So the short answer is: the "virome decontamination" is probably best for your data.
Best, Simon
Hi Simon,
Thank you for the informative and punctual reply - I will proceed with using the virome decontamination setting for my future VirSorter runs.
Hi @simroux and @Binvir
I realize this issue is closed but I had a related question that seems to fit best here. I have two non-viral metagenome assemblies that are part of the same study. After a normal virsorter run I get this message for only one of the assemblies
More than 25% of the bp in contigs >= 10kb were predicted as viral (estimated ratio: 28.77%)...
You may want to use the virome decontamination mode on this dataset, as it seems to have lot of viruses
My question is whether it is better to use --virome
decontamination on both datasets for consistency or use it only on the one high in putative virus contigs?
Hi Jarrod,
That is a good question. The "warning" message is a recent addition, so I'm reopening the issue as it may be useful to a number of folks.
My recommendation here would be to re-run VirSorter in virome decontamination mode for this one assembly, and compare the results to the "regular" mode. By comparing the results, I mean counting how many additional viral contigs are identified, and manually inspecting a few of them. If these look like genuine viruses, then I would recommend running everything as "virome decontamination", for consistency.
A quick background on this: regular VirSorter estimates a number of parameters (ratio of genes with PFAM hits, ratio of genes with viral hits, average gene size, etc) from the data themselves. The original idea was that it would be more accurate to estimate these from the whole (mostly microbial) dataset, and then look for sequences that look "more viral than the average". In some cases, e.g. if a dataset has a substantial portion of viruses, this approach fails as the "average" is already viral. So the "virome decontamination" mode uses pre-computed parameters from bacterial/archaeal genomes in RefSeq instead of estimates from the data. Eventually, this virome decontamination mode tends to work with all kinds of datasets (microbial or viral) is sensitive to "unusual" microbial genomes (i.e. not well captured by RefSeq genomes).
Hi Simon,
Excellent, good idea.
As a preemptive measure, I ran both assemblies using regular
and --virome
:-) and the analyses just finished earlier today. For both assemblies the --virome
results have ~5x the number of entries in the VIRSorter_global-phage-signal
file as the regular settings. I am looking through the data now. I will provide a summary here when I am finished...
for the record, here are the commands I ran for each assembly...
wrapper_phage_contigs_sorter_iPlant.pl -f WA-contigs.fa --ncpu 25 --db 2 --wdir VIRSORTER/WA --data-dir virsorter-data --diamond
wrapper_phage_contigs_sorter_iPlant.pl -f WA-contigs.fa --ncpu 25 --db 2 --wdir VIRSORTER_VIROME/WA --data-dir virsorter-data --diamond --virome
Hi Simon
Here are some details. Maybe too many, maybe too few :)
The two datasets:
EP: 753612 contigs WA: 574305 contigs. This was the assembly VirSorter flagged for decontamination.
For the EP dataset, only 12 hits were detected in the regular
and not in the virome
. Similarly in the WA dataset, only 9 hits were detected in the regular
and not in the virome
. So, --virome
seems to pick up everything regular
does. When the two methods overlap, they appear to assign the same category. I eyeballed this, so not quantitative.
The one thing I am not sure about is the "best way" to test the hits. Suggestions? Straight up blastp against nr
or RefSeq
? I tested a handful of genes from a handful of contigs and basically most hits are low percent identity and hypothetical proteins from an assortment of taxa. Some high percent phage hits. Definitely not seeing anything that screams microbe and not virus.
This table breaks down how the number of hits by treatment for each category from each dataset.
category | EP-REGULAR | EP-VIROME | per_change | WA-REGULAR | WA-VIROME | per_change |
---|---|---|---|---|---|---|
1_Complete_phage_contigs_cat_1_sure | 1735 | 5779 | 233.08 | 2163 | 6128 | 183.31 |
2_Complete_phage_contigs_cat_2_some_what_sure | 9869 | 41561 | 321.13 | 8956 | 37571 | 319.51 |
3_Complete_phage_contigs_cat_3_not_so_sure | 285 | 3596 | 1161.75 | 145 | 4238 | 2822.76 |
4_Prophages_cat_1_sure | 8 | 2 | -75.00 | 70 | 0 | -100.00 |
5_Prophages_cat_2_some_what_sure | 54 | 25 | -53.70 | 244 | 23 | -90.57 |
6_Prophages_cat_3_not_so_sure | 4 | 19 | 375.00 | 4 | 32 | 700.00 |
Hi Jarrod, Thanks for the additional information :-) Here is how I typically look at these cases:
The file "VIRSorter_affi-contigs.tab" in the folder "Metric_files" has the gene-by-gene annotation that VirSorter uses to make its calls. Columns are: Gene ID, Start, Stop, Length, Strand, Hit in phage cluster database, score, e-value, Phage cluster category, Hit in pfam, score, e-value. You should be able to "grep" individual contigs from this file (adding "-gene" at the end of the contig name if needed). I like to look at this file because these are the exact data VirSorter looked at to calculate enrichment / depletion statistics, and so if a contig is almost entirely unknown, with e.g. only ~ 10% PFAM, but was not considered as "depleted in PFAM hits" in the regular mode, I know that I should use the virome decontamination mode.
In your case, based on the results you've seen here for categories 1 sequences, it seems like you should use the virome decontamination mode. Basically VirSorter uses 2 types of metrics: viral hallmark gene (which work the same in regular vs virome mode since it's simple presence/absence) and enrichments/depletion stats (which will be different between regular and virome modes). For a contig to be category 1, it needs to be significant in some enrichment stat + have ≥1 hallmark gene. The fact that you find "new" category 1 contigs means that there were sequences with a hallmark gene (so most are likely viral), were not considered enriched in viral genes or depleted in pfam in regular mode, but are considered enriched/depleted in virome mode. That suggests to me that the background stats computed from the whole dataset were too "stringent", i.e. the overall percentage of phage cluster & pfam affiliation was too similar to a "normal" virus genome, and eventually there was very little significant enrichment/depletion.
Hopefully this is somewhat clear, please let me know if you have any question or if the data doesn't seem to match my assumptions :-) Btw, are these metagenomes anything special, e.g. samples filtered in a specific way ?
Hi Simon
Great explanation! I will look at those data today. And the EP assembly has a lot of viral hits but it looks like it was a little below VirSorters warning threshold. So I agree that decontamination for both sets is looking like the best way forward.
These are near shore marine samples, collected at 3-5m deep and filtered through 0.22 micro filters. Even Kaiju classification on the contigs and Kraken on the short reads ( for each sample) shows a relatively high percentage of viral hits, like 15% of total datasets. I honestly don’t have enough experience with these kinds of samples to know if this is “normal” or not.
On Wed, Dec 11, 2019 at 10:40 simroux notifications@github.com wrote:
Hi Jarrod, Thanks for the additional information :-) Here is how I typically look at these cases:
-
The file "VIRSorter_affi-contigs.tab" in the folder "Metric_files" has the gene-by-gene annotation that VirSorter uses to make its calls. Columns are: Gene ID, Start, Stop, Length, Strand, Hit in phage cluster database, score, e-value, Phage cluster category, Hit in pfam, score, e-value. You should be able to "grep" individual contigs from this file (adding "-gene" at the end of the contig name if needed). I like to look at this file because these are the exact data VirSorter looked at to calculate enrichment / depletion statistics, and so if a contig is almost entirely unknown, with e.g. only ~ 10% PFAM, but was not considered as "depleted in PFAM hits" in the regular mode, I know that I should use the virome decontamination mode.
In your case, based on the results you've seen here for categories 1 sequences, it seems like you should use the virome decontamination mode. Basically VirSorter uses 2 types of metrics: viral hallmark gene (which work the same in regular vs virome mode since it's simple presence/absence) and enrichments/depletion stats (which will be different between regular and virome modes). For a contig to be category 1, it needs to be significant in some enrichment stat + have ≥1 hallmark gene. The fact that you find "new" category 1 contigs means that there were sequences with a hallmark gene (so most are likely viral), were not considered enriched in viral genes or depleted in pfam in regular mode, but are considered enriched/depleted in virome mode. That suggests to me that the background stats computed from the whole dataset were too "stringent", i.e. the overall percentage of phage cluster & pfam affiliation was too similar to a "normal" virus genome, and eventually there was very little significant enrichment/depletion.
Hopefully this is somewhat clear, please let me know if you have any question or if the data doesn't seem to match my assumptions :-) Btw, are these metagenomes anything special, e.g. samples filtered in a specific way ?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/simroux/VirSorter/issues/40?email_source=notifications&email_token=AD3RFNT4KIC77ALYXWLJ37TQYECYHA5CNFSM4HNYUG2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGTSKOQ#issuecomment-564602170, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD3RFNTH5CL2LL4CTU6367LQYECYHANCNFSM4HNYUG2A .
Hi Jarrod, "filtered through 0.22 micro filters" is the key thing here, and the results make a lot of sense. Metagenomes from cells collected on 0.22 micro filters look like "regular" metagenomes, but metagenomes from the filtrate below a 0.22 micro filter typically have tons of viruses and look more like a viral metagenome, even if no other steps are taken to enrich in viral particles. We've seen this in the past in e.g. rumen microbiome (https://www.nature.com/articles/s41564-018-0225-4), and we had to also use the virome decontamination mode on these size fractions :-)
Best, Simon
Hi Simon,
Firstly, thank you for this tool - it has been of great use in a viral identification pipeline. I had a question about the virome decontamination setting for the CyVerse implementation of VirSorter. I ran a metagenome generated from a sample, collected on a 0.1μm filter, that was not treated or processed in any way to specifically enrich for viruses. Out of curiosity, I ran VirSorter on this metagenome with and without the virome decontamination function. In the case of the former, cat-1 and cat-2 predictions were more than double those obtained if the virome decontamination setting was not applied. Could you please comment as to what may be going on here? I am not sure what approach is best for my data.
Thank you for your time!