jessieren / VirFinder

VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data
Other
130 stars 24 forks source link

Q value smaller than P value #6

Closed feargalr closed 6 years ago

feargalr commented 6 years ago

Hello,

I just ran the program for the first time on a set of just over 5000 contigs, most (or all) should be viral. However after attaching q values with the VF.qvalue function many q values are smaller than the original P value, for example a P value of 0.98 became a Q value of 0.04. This seems like a bug? Unless the Q value adjustment does some other step aside from adjusting P values for multiple testing?

Thanks

jessieren commented 6 years ago

Hi there,

Firstly, thank you for using VirFinder :)

VirFinder uses the existing function "qvalue" in the R package "qvalue" by John D. Storey. It estimates the false discovery rate (q-value) given the p-value.

So here is my understanding. Since most (or all) contigs are viral, even if a large threshold, such as 0.98, is used for prediction, the false discovery rate (i.e., the proportion of wrongly predicted bacteria) can still be very low. That is because the overall bacteria content is very low, so even if all contigs are predicted as viral, among the prediction, only a very small proportion of bacteria contigs are wrongly predicted as viral. Thus, the number of qvlaue=0.04 may be a rough indicator for the content of bacteria contigs in your sample.

Please let me know if my understanding is not correct. Thanks for your question.

Best wishes, Jessie


From: feargalr notifications@github.com Sent: Thursday, September 21, 2017 5:04:05 AM To: jessieren/VirFinder Cc: Subscribed Subject: [jessieren/VirFinder] Q value smaller than P value (#6)

Hello,

I just ran the program for the first time on a set of just over 5000 contigs, most (or all) should be viral. However after attaching q values with the VF.qvalue function many q values are smaller than the original P value, for example a P value of 0.98 became a Q value of 0.04. This seems like a bug? Unless the Q value adjustment does some other step aside from adjusting P values for multiple testing?

Thanks

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_jessieren_VirFinder_issues_6&d=DwMCaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=irWyXBTJAqCxHN7GNey4-g&m=09wjmExoqSaGG-XWn0OA8Xb1PfXdZ61Nfkj9-ZcKYKs&s=P_yWt3lMzEOz3xWOG4354_IqJvn5j47gAuUbaniw_Lw&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AHgpvMdBfwRaPqtH48cLmXtYUI7vpAsLks5sklC1gaJpZM4PfN3h&d=DwMCaQ&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=irWyXBTJAqCxHN7GNey4-g&m=09wjmExoqSaGG-XWn0OA8Xb1PfXdZ61Nfkj9-ZcKYKs&s=CTquVivj8JfI2_U0fKg0Rhw9tqRZvIAoxPfvboJkbPA&e=.

feargalr commented 6 years ago

Hi Jessie,

Thanks for the quick reply and I ran VirFinder on a shotgun metagenomic dataset with high levels of bacterial contamination and the results seemed a bit more normal (not every contig has a qvalue less than 0.05). But it still seems odd to me.

Another case for example that seemed to strange to me was a contig with an extremely low score from Virfinder (7.176055e-09) and a p-value of 1 but has a q value of 0.0505?

Thanks, Feargal

jessieren commented 6 years ago

Hi Feargal,

Thanks for your effort on testing on different cases. It is good that the qvalues for metagenomic dataset look more normal.

Note that the false discovery rate is defined based on (the # of bacteria contigs wrongly predicted as viral)/(# of predicted viral contigs). The number of bacteria contigs wrongly predicted as viral = the probability that a bacteria is wrongly predicted as viral * the number of bacterial contigs in the sample. Thus, it could be possible that the false discovery rate is not that small even if the probability is small.

Another possibility is that the method does not provide good FDR estimations. Honestly, no FDR method is perfect, and each method only performs well when its underlying assumption is satisfied. Since "VF.qvalue" is the exact function "qvalue" in the qvalue package, you may directly use the qvalue function given pvalues as the input, since more options in the original function are available,

qobj <- qvalue(predResult$pvalue) hist(qobj)

There are several options in the qvalue function, such as fdr.level and pfdr. You may explore different option settings with your data, and see if any option help the data make more sense.

On the other hand, though the method John Storey developed is the most widely used one, there are other FDR controlling methods, such as Benjamini & Hochberg. Trying different method may be a good idea. I found the webpage by Korbinian Strimmer listing all existing FDR control packages http://strimmerlab.org/notes/fdr.html. Just for your reference.

In addition, if qvalue is not desirable, setting a threshold for the p-value is another simple and easy method for prediction.

Thank you for bringing up this discussion. Hope it helps! :)

Best wishes, Jessie

On Fri, Sep 22, 2017 at 8:14 AM, feargalr notifications@github.com wrote:

Hi Jessie,

Thanks for the quick reply and I ran VirFinder on a shotgun metagenomic dataset with high levels of bacterial contamination and the results seemed a bit more normal (not every contig has a qvalue less than 0.05). But it still seems odd to me.

Another case for example that seemed to strange to me was a contig with an extremely low score from Virfinder (7.176055e-09) and a p-value of 1 but has a q value of 0.0505?

Thanks, Feargal

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

-- Jie Jessie Ren Postdoc in Computational Biology and Bioinformatics University of Southern California 1050 Childs Way, RRI 201

feargalr commented 6 years ago

Thanks