Closed shiraz-shah closed 3 years ago
I've never benchmarked the difference between the two options. But it's a tricky choice. The likelihoods are actually comparable between models, it is the probability of emission of the phage contig by the associated host Markov chain. When taking the model best withthe best likelihood, we wanted to have a measure of how significant/good the hit was, and that's why we devised this pvalue. But if you take a model of a bacteria that has a very strong bias (let's say very high GC content), then it may be that the usual likelihood it gets on phage contigs is very low (i.e. low mean likelihood with little variance on contigs), but as soon as one contig will have a slight bias towards hogh GC, then the likelihood of the model will increase and the pvalue may change dramaticallly for this model. Meaning that it's an exceptional value for this model, even if there are models that fit better the contig with higher likelihood. On top of that, I'd consider the p-value as "approximative" since the parameters to compute it is inferred from noisy data (from contigs that are supposed not to infect the host). So it can be useful to assess the significance of a hit (i.e. binary decision), but i'd say not enough to sort the hits).
That's why in my opinion it is better to select the best likelihoods among those that make sense (i.e. having a significant pvalue for them).
A proper benchmarking of this would be needed to confirm one option or the other, I'll let you know if I find time to do that.
Best, Clovis.
I can help you benchmark, as I have a giant data set, and am comparing five different host prediction methods. But I'm no wiz at R, so I'll need you to write the code for cutting the matrices and generating tables etc.
Here are the top 20 host genera I get for a gut virome with the WIsH best prediction at a p <= 0.05 cutoff:
1051 "Lachnoclostridium"
772 "Faecalibacterium"
602 "Clostridium"
536 "Blautia"
393 "Ruminococcus"
373 "Bacteroides"
355 "Paenibacillus"
331 "Bifidobacterium"
312 "Clostridioides"
305 "Bacillus"
292 NA
277 "Escherichia"
226 "Tumebacillus"
208 "Flavonifractor"
198 "Oscillibacter"
164 "Roseburia"
143 "Parabacteroides"
141 "Eubacterium"
141 "Akkermansia"
140 "Prevotella"
And here's the same with top5 using your code and the lca program by P. Menzel:
7591 NA
327 Clostridium
242 Bifidobacterium
241 Escherichia
156 Bacillus
152 Bacteroides
54 Clostridioides
52 Paenibacillus
43 Lactobacillus
20 Veillonella
16 Haemophilus
16 Enterococcus
13 Flavobacterium
12 Porphyromonas
11 Prevotella
11 Lactococcus
11 Citrobacter
10 Neisseria
10 Mannheimia
9 Streptococcus
For me the former looks better than the latter because it fits our expectations about what bacteria we know are present in the gut. Also, the latter seems a tad too conservartive. Even at the order level, most phage are unassigned:
5565 NA
1815 Clostridiales
466 Enterobacterales
335 Bacteroidales
330 Bacillales
242 Bifidobacteriales
87 Pasteurellales
87 Lactobacillales
61 Flavobacteriales
20 Veillonellales
12 Rickettsiales
12 Pseudomonadales
11 Burkholderiales
10 Neisseriales
8 Rhizobiales
6 Micrococcales
4 Eggerthellales
4 Desulfurobacteriales
2 Rhodobacterales
2 Propionibacteriales
To me it looks like, using the top5 method, WIsH is good at predicting hosts for whom many genome representatives exist, while hosts that have too few genome representatives can't populate the top5 list, and will end up defaulting to a lower taxonomical hierarchy after lca.
I don't know if top3 could be a better choice, or maybe sorting by p-value at the phage level could make a difference. Or maybe one needs to look at the difference in ll value between the best and second-best match. If that difference is too high, only top1 should be taken, so top n for each phage or host depends on the development of the ranked ll values.
Interesting! The piece of code I provided is supposed to kick out all the insignificant hits and to perform lca on the 5 at most top hits, fewer if not enough hits are found. Maybe spurious hits with low pvalue make their way to this list and artificially increase the rank level.
Here is a script to help you in the filtering: number_of_top_pred = 5 pValThs = 0.001 deltaLikelihood = 0.01 # to be benchmarked
number_of_top_pred = 5 pValThs = 0.3 deltaLikelihood = 0.008 # to be benchmarked
library(dplyr) library(tidyr)
pvals = read.table("outputResultDir/pvalues.matrix") pvals$model = rownames(pvals) ll = read.table("outputResultDir/llikelihood.matrix") ll$model = rownames(ll)
pvalList = pvals %>% gather(phage,pval,-model) llList = ll %>% gather(phage,likelihood,-model) joinedList = pvalList %>% inner_join(llList, by=c("model","phage"))
bestLLamongSignificant = as.matrix(joinedList %>% filter(pval<pValThs) %>% group_by(phage) %>% top_n(n=number_of_top_pred,wt=likelihood))
bestPvalHits = as.matrix(joinedList %>% group_by(phage) %>% top_n(n=number_of_top_pred,wt=-pval))
bestGapHits = joinedList %>% group_by(phage) %>% arrange(-likelihood) %>% mutate(rolldiff = cumsum(lag(likelihood,default=0)-likelihood>deltaLikelihood)) %>% filter(rolldiff<2)
bestGapSignificantHits = joinedList %>% filter(pval<pValThs) %>% group_by(phage) %>% arrange(-likelihood) %>% mutate(rolldiff = cumsum(lag(likelihood,default=0)-likelihood>deltaLikelihood)) %>% filter(rolldiff<2)
Do not hesitate to ask me further scripts!
Best, Clovis.
Thanks, Clovis! Will try this and get back to you.
Hej Clovis, Why is the ll matrix first filtered by pvalue cutoff and then sorted by ll value to take the top 5 phages per model?
Wouldn't it make more sense to take the top5 lowest p-values per phage? I know the ll values are not comparable accross different models, but the p-values must be, right?