soedinglab / WIsH

Predict prokaryotic host for phage metagenomic sequences
GNU General Public License v3.0
50 stars 10 forks source link

Advanced analysis sorting #8

Closed shiraz-shah closed 3 years ago

shiraz-shah commented 4 years ago

Hej Clovis, Why is the ll matrix first filtered by pvalue cutoff and then sorted by ll value to take the top 5 phages per model?

Wouldn't it make more sense to take the top5 lowest p-values per phage? I know the ll values are not comparable accross different models, but the p-values must be, right?

ClovisG commented 4 years ago

I've never benchmarked the difference between the two options. But it's a tricky choice. The likelihoods are actually comparable between models, it is the probability of emission of the phage contig by the associated host Markov chain. When taking the model best withthe best likelihood, we wanted to have a measure of how significant/good the hit was, and that's why we devised this pvalue. But if you take a model of a bacteria that has a very strong bias (let's say very high GC content), then it may be that the usual likelihood it gets on phage contigs is very low (i.e. low mean likelihood with little variance on contigs), but as soon as one contig will have a slight bias towards hogh GC, then the likelihood of the model will increase and the pvalue may change dramaticallly for this model. Meaning that it's an exceptional value for this model, even if there are models that fit better the contig with higher likelihood. On top of that, I'd consider the p-value as "approximative" since the parameters to compute it is inferred from noisy data (from contigs that are supposed not to infect the host). So it can be useful to assess the significance of a hit (i.e. binary decision), but i'd say not enough to sort the hits).

That's why in my opinion it is better to select the best likelihoods among those that make sense (i.e. having a significant pvalue for them).

A proper benchmarking of this would be needed to confirm one option or the other, I'll let you know if I find time to do that.

Best, Clovis.

shiraz-shah commented 4 years ago

I can help you benchmark, as I have a giant data set, and am comparing five different host prediction methods. But I'm no wiz at R, so I'll need you to write the code for cutting the matrices and generating tables etc.

Here are the top 20 host genera I get for a gut virome with the WIsH best prediction at a p <= 0.05 cutoff:

   1051 "Lachnoclostridium"
    772 "Faecalibacterium"
    602 "Clostridium"
    536 "Blautia"
    393 "Ruminococcus"
    373 "Bacteroides"
    355 "Paenibacillus"
    331 "Bifidobacterium"
    312 "Clostridioides"
    305 "Bacillus"
    292 NA
    277 "Escherichia"
    226 "Tumebacillus"
    208 "Flavonifractor"
    198 "Oscillibacter"
    164 "Roseburia"
    143 "Parabacteroides"
    141 "Eubacterium"
    141 "Akkermansia"
    140 "Prevotella"

And here's the same with top5 using your code and the lca program by P. Menzel:

   7591 NA
    327 Clostridium
    242 Bifidobacterium
    241 Escherichia
    156 Bacillus
    152 Bacteroides
     54 Clostridioides
     52 Paenibacillus
     43 Lactobacillus
     20 Veillonella
     16 Haemophilus
     16 Enterococcus
     13 Flavobacterium
     12 Porphyromonas
     11 Prevotella
     11 Lactococcus
     11 Citrobacter
     10 Neisseria
     10 Mannheimia
      9 Streptococcus

For me the former looks better than the latter because it fits our expectations about what bacteria we know are present in the gut. Also, the latter seems a tad too conservartive. Even at the order level, most phage are unassigned:

   5565 NA
   1815 Clostridiales
    466 Enterobacterales
    335 Bacteroidales
    330 Bacillales
    242 Bifidobacteriales
     87 Pasteurellales
     87 Lactobacillales
     61 Flavobacteriales
     20 Veillonellales
     12 Rickettsiales
     12 Pseudomonadales
     11 Burkholderiales
     10 Neisseriales
      8 Rhizobiales
      6 Micrococcales
      4 Eggerthellales
      4 Desulfurobacteriales
      2 Rhodobacterales
      2 Propionibacteriales

To me it looks like, using the top5 method, WIsH is good at predicting hosts for whom many genome representatives exist, while hosts that have too few genome representatives can't populate the top5 list, and will end up defaulting to a lower taxonomical hierarchy after lca.

I don't know if top3 could be a better choice, or maybe sorting by p-value at the phage level could make a difference. Or maybe one needs to look at the difference in ll value between the best and second-best match. If that difference is too high, only top1 should be taken, so top n for each phage or host depends on the development of the ranked ll values.

ClovisG commented 4 years ago

Interesting! The piece of code I provided is supposed to kick out all the insignificant hits and to perform lca on the 5 at most top hits, fewer if not enough hits are found. Maybe spurious hits with low pvalue make their way to this list and artificially increase the rank level.

Here is a script to help you in the filtering: number_of_top_pred = 5 pValThs = 0.001 deltaLikelihood = 0.01 # to be benchmarked

number_of_top_pred = 5 pValThs = 0.3 deltaLikelihood = 0.008 # to be benchmarked

library(dplyr) library(tidyr)

pvals = read.table("outputResultDir/pvalues.matrix") pvals$model = rownames(pvals) ll = read.table("outputResultDir/llikelihood.matrix") ll$model = rownames(ll)

pvalList = pvals %>% gather(phage,pval,-model) llList = ll %>% gather(phage,likelihood,-model) joinedList = pvalList %>% inner_join(llList, by=c("model","phage"))

select the best LL results among predictions with good pvalue

bestLLamongSignificant = as.matrix(joinedList %>% filter(pval<pValThs) %>% group_by(phage) %>% top_n(n=number_of_top_pred,wt=likelihood))

select the best pred according to pvalue

bestPvalHits = as.matrix(joinedList %>% group_by(phage) %>% top_n(n=number_of_top_pred,wt=-pval))

select the best preds until gap in LL, the gap is tuned with deltaLikelihood, which value is to be benchmarked

bestGapHits = joinedList %>% group_by(phage) %>% arrange(-likelihood) %>% mutate(rolldiff = cumsum(lag(likelihood,default=0)-likelihood>deltaLikelihood)) %>% filter(rolldiff<2)

the same but only among significant hits

bestGapSignificantHits = joinedList %>% filter(pval<pValThs) %>% group_by(phage) %>% arrange(-likelihood) %>% mutate(rolldiff = cumsum(lag(likelihood,default=0)-likelihood>deltaLikelihood)) %>% filter(rolldiff<2)

Do not hesitate to ask me further scripts!

Best, Clovis.

shiraz-shah commented 4 years ago

Thanks, Clovis! Will try this and get back to you.