replikation / What_the_Phage

WtP: Phage identification via nextflow and docker or singularity
https://mult1fractal.github.io/wtp-documentation/
GNU General Public License v3.0
102 stars 15 forks source link

VirSorter filtering #15

Closed hoelzer closed 4 years ago

hoelzer commented 4 years ago

I saw that you filter the VirSorter output to only collect phages from the files

cat !{results}/Predicted_viral_sequences/VIRSorter_cat-[1,2].fasta | grep ">" | sed -e s/\\>VIRSorter_//g | sed -e s/-cat_1//g |\
  sed -e s/-cat_2//g  > virsorter.txt

Why not also include the cat-3 phages? To reduce the amount of false-positive hits? However, to have a somewhat fair comparison with the other tools it might be worth to also include cat-3 phages identified by VirSorter?

And prophages you are not interested at all?

replikation commented 4 years ago

@Stormrider935 please read up on what to include ;)

replikation commented 4 years ago

The current status is (i think) that filtering is still work in progress, between these tools. So if you may have additional insights in what to include, feel free to add ;)

mult1fractal commented 4 years ago

yes i will check later.. the first version of all filter scripst were included to generate some output for replikation's heatmap.

mult1fractal commented 4 years ago

overview phage tools parameters

Tool criterion hit possible hit meh
MARVEL % 100-75 74.9 -50 49.9-0
VirFinder score (the higher the score, the higher hte possibilityfor a phage hit) A score of 1 represents perfect identification of all true viral contigs with no false positives, and ascore of 0.5 represents a random classification 0.999-0.75 0.749 - 0.5 0.499 - 0.0
PPR-Meta phage phage
VirSorter sure and somwhat sure (filters for prophage and complete phage contig) sure pvalue under 0.01 somewhat sure not so sure
MetaPhinder classification : Phage Phage - negative
DeepVirFinder the higher the score, the higher the possibility for a phage hit 0.999-0.75 0.749 - 0.5 0.499 - 0.0

include hit and possible hit

MARVEL

quote from paper

A contig was considered viral if predicted in categories I and II for VirSorter, and if the q-value was less than or equal to 0.01 for VirFinder

Virfinder

quote from paper

A score of 1 represents perfect identificationof all true viral contigs with no false positives, and a score of 0.5 represents a random classification.

PPR-Meta

quote from paper

difference between the phage score and chromosome score reveals the lifestyle of the phages (virulent or temperate), while the difference between the plasmid score and chromosome score reveals the transmissibility of the plasmids (transmissible or non-transmissible). the category with the highest score is selected as our prediction. unsicher ob phage score dem pvalue entspricht

Virsorter

quote from paper

3 Parameter: sure, somewhat sure, not so sure

for pro Phage: maybe generate a special output for tools that recognize pophages

deepvirfinder

quote from paper

We used deep learning techniques and developed a powerful framework for predicting viral sequences. Given aquery sequence, the framework gives a score between 0 and 1, and the larger score indicates the higher possibilityof being a viral sequence.

Metaphinder

replikation commented 4 years ago

@hoelzer we were thinking about changing the heatmap as a 2 color coded showing "certain" and "uncertain" hits

€ So the informations are not completely lost

hoelzer commented 4 years ago

@Stormrider935 cool thanks for the summary! Looks reasonable to me. And I think telling the user that he might not see a subset of the full output of all tools is important.

Changing the heat map to show "certain"/"uncertain" hits also sounds reasonable.

However, by thinking about the visualization of the results I remembered a cool plot I saw yesterday during a data visualization talk here at EBI (https://www.ebi.ac.uk/training/online/course/data-visualisation-101-practical-introduction-designing-scientific-figures)

upsetr-example

Code: https://github.com/hms-dbmi/UpSetR

Couldn't this be a nice (additional) visualisation? I mean in the end you want to know how many phages were discovered by which tools and what is the overlap. Are there phages discovered by all tools? Are there single tools that discovered phages no other tool has? From such a plot this would be easy to see. Instead of Gene names like shown in the plot above, write the tool names?

replikation commented 4 years ago

the figure is nice, but how and what data should be included for this? i mean what should the barplot represent?

replikation commented 4 years ago

but yeah we could plot p values and such.

hoelzer commented 4 years ago

I thought that you can throw in a FASTA file and then you get phage predictions with different tools. Let's say

VirSorter: 20 phages VirFinder: 12 phages MARVEL: 30 phages

This are then the blue bars in the figure.

Then all overlapps (venn-diagram-style) are calculated, lets say:

VirSorter-VirFinder: 10 VirSorter-MARVEL: 18 VirFinder-MARVEL: 8 VirSorter-VirFinder-MARVEL: 6

And this could be then the black bar plots.

replikation commented 4 years ago

ah okay now i get it :) yeah with this plot it will be badass.

hoelzer commented 4 years ago

ah okay now i get it :) yeah with this plot it will be badass.

yeap, totally badass :D it's a super nice visualization alternative to venn diagrams (that one should anyway not use with more than 3 sets)

hoelzer commented 4 years ago

I think one open question might be how to define if two tools find the same phage. But maybe it's simple enough to check

tool A found contig X to be a phage tool B found contig X to be a phage

ok, both tools found a phage

replikation commented 4 years ago

when i have time ill try to implement something, and then we can take a look :)

replikation commented 4 years ago

my plan would be like this (x is a hit), not sure were to put the contig name.

                 x
                 xx
                 ____
    x metaphinderXO
      PPRmeta    OO
   xx Virsorter  XX
hoelzer commented 4 years ago

my plan would be like this (x is a hit), not sure were to put the contig name.

                 x
                 xx
                 ____
    x metaphinderXO
      PPRmeta    OO
   xx Virsorter  XX

yeah exactly what I thought. Unfortunately, I think in this visual the contig name can not be included... it would be then more like a general overview which tools find the most (common) phages

hoelzer commented 4 years ago

I added some example code for UpSetR here:

https://github.com/hoelzer/visual

maybe helpful if you want to add this to WtP at some point.

hoelzer commented 4 years ago

I fixed the docker for upsetr, its working now and implemented in branch upsetr. Please feel free to test @replikation @Stormrider935 and if satisfied, feel free to merge the branch ;)

At the moment the implementation is 'basic', so if new tools are added also the upset.R script needs to be updated. That could be enhanced with some R coding.

The upset plot shows the same content a venn diagram would show: simply by bars and easier to check for certain sets.

replikation commented 4 years ago

nice ill check that out

replikation commented 4 years ago

closing this, see #27 for performance