jessieren / VirFinder

VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data
Other
130 stars 24 forks source link

defining q value threshold #11

Closed jzrapp closed 4 years ago

jzrapp commented 5 years ago

Hi,

I'm using VirFinder for the first time and try to figure out where to set the threshold for categorizing into "good" and "bad" predictions. It's not clear to me from the corresponding publications how you selected the threshold there. Could you help, @jessieren ?

Thanks!

jessieren commented 5 years ago

Hi jzrapp,

Thank you for using VirFinder. In our paper, we chose to set a threshold on q-value. For example we chose the contigs with q-value < 15%, so that the false discovery rate was controlled at 15%.

Hope that helps.

Jessie

jzrapp commented 5 years ago

Thank you for the reply, Jessie! I found the threshold in your publication only by browsing through the supplementary tables. I cannot find an explanation for why you chose 15% in your methods. Was there any rational behind it or did you just randomly select 15%? I'm trying to figure out if there is any "typical/standard" threshold one should use for this type of analysis.

Thanks, Josephine

jessieren commented 5 years ago

Hi Josephine,

We chose to use 15% was first because we would like to control the false discovery rate below 15%. In addition, in that particular application, we wanted to make a fair comparison between VirFinder and VirSorter on their disease prediction accuracy using viruses (Page 9, example application). VirSorter predicted 2657 viruses from the data, so we also chose the same number of viruses with the highest VirFinder scores. The false discovery rate for those 2657 VirFinder predicted viruses was found to be at 15%.

In practice, we suggest users to choose a threshold based on their tolerance of false discovery rate. The commonly used thresholds can be 5%-15%. This gives a trade-off between the precision and the recall.

Thank you!

Best wishes, Jessie

473021677 commented 5 years ago

Hi, As you have discussed before, the commonly used thresholds can be 5%-15% for q value when we use VirFinder to identify viral scaffolds from metagenomic assembled datasets. But I am not so sure that what threshold we should choose for the score and p value threshold? Maybe the p value threshold should be set as 0.05. Would it be possible for you to help me?

jessieren commented 5 years ago

Hi there, I would suggest to use q-value for controlling the false discovery rate than p-value. P-value is for the false positive rate, i.e. it tells you if your input is all bacteria, what percentage of them will be wrongly predicted as viruses. Q-value gives you concept that among all the viruses you predicted, how many can be wrongly called.

473021677 commented 5 years ago

Thanks for your reply very much. So P-value and Q-value are both the false positive rate. The P-value aims at the prokaryotic contigs and the Q-value aims at the viral contigs. I have used VirFinder to identify viral contigs from the assembled metagenomic datasets at the 15% threshold for the Q-value. I have also used VirSorter to identify viral contigs. But there is no overlap for the viral prediction results between VirFinder and VirSorter. This is almost impossible. Maybe there is something wrong with my analysis. Could you help me?

jessieren commented 5 years ago

Hi there,

Sorry for my late reply.

P-value measures the false positive rate, and Q-value measures the false discovery rate. They are different by definition.

I do not know the problem you are working on. VirSorter is targeting finding viruses containing known genes, while VirFinder is more general regardless of genes and can potentially find new viruses. The two methods have their own focuses.

Hope that helps!

Jessie

473021677 commented 5 years ago

Thanks for your reply. I will try to identify the problems. I will take advice from you if I have any problem. Thanks.