im3sanger / dndscv

dN/dS methods to quantify selection in cancer and somatic evolution
GNU General Public License v3.0
212 stars 48 forks source link

Didn't get pglobal_cv and qglobal_cv columns in the output file from the 1st step #67

Closed RupalHatkar closed 3 years ago

RupalHatkar commented 3 years ago

Hello,

I downloaded the package and ran the first set of mutations but didn't get pglobal_cv and qglobal_cv columns. Any suggestions? I didn't get any errors either so not sure what happened.

Thank you, Rupal

im3sanger commented 3 years ago

Hi Rupal,

Thank for your interest in the package.

pglobal_cv and qglobal_cv are the joint p-values and q-values combining information from substitutions and indels. If your mutation table does not contain coding indels (or contains fewer than 5 by default, see ? dndscv), the indel model is not used. In that case, you can use qallsubs_cv as the "global" q-value. Also see issue https://github.com/im3sanger/dndscv/issues/12 for more information.

Could this explain your result?

Best wishes, Inigo

RupalHatkar commented 3 years ago

Hi Rupal,

Thank for your interest in the package.

pglobal_cv and qglobal_cv are the joint p-values and q-values combining information from substitutions and indels. If your mutation table does not contain coding indels (or contains fewer than 5 by default, see ? dndscv), the indel model is not used. In that case, you can use qallsubs_cv as the "global" q-value. Also see issue #12 for more information.

Could this explain your result?

Best wishes, Inigo

Hello Inigo,

Thank you so much! It worked after I included indels. Quick question. I was using this tool to get possible driver mutations. However, I only got 3 genes after the significant gene step. It didn't give me TP53 and HRAS which are present in my data set and are known driver variant, which also have been reported before. Is it normal that it would miss the driver variants? Do you have any recommendations?

Thank you!! Rupal

im3sanger commented 3 years ago

Hi Rupal,

It depends on how frequent the TP53 and HRAS mutations are in your dataset. If for example they occur only once or twice, they may not reach significance. Remember that dNdScv by default does not have any prior knowledge of the disease.

One solution is to use the q-values to report genes reaching significance, but also report mutations likely to be drivers based on prior information. You can see the full list of annotated mutations in the dNdScv output: dndsout$annotmuts.

There are also ways to use prior information to increase your sensitivity to known drivers statistically. The simpler option is to use restricted hypothesis testing (RHT). See Lawrence et al., 2014 for more details. You can do this yourself using the dndsout$sel_cv output table by restricting it to an a priori list of known cancer genes and recalculating q-values. The code below demonstrates how to calculate RHT q-values (qval_RHT) using the version of cancer gene census (v81, 603 cancer genes) provided in the package as a prior list of cancer genes, as an example. It is essential that the prior list of cancer genes is defined a priori, before seeing any of the data.

library(dndscv)
data("dataset_simbreast", package="dndscv")
data("cancergenes_cgc81", package="dndscv")
dndsout = dndscv(mutations)
sel_RHT = dndsout$sel_cv[which(dndsout$sel_cv$gene_name %in% known_cancergenes), ]
sel_RHT$qval_RHT = p.adjust(sel_RHT$pglobal_cv, method="BH")

You can see that RHT q-values will be more sensitive to positive selection in the list of known cancer genes.

Best wishes, Inigo