im3sanger / dndscv

dN/dS methods to quantify selection in cancer and somatic evolution
GNU General Public License v3.0
212 stars 48 forks source link

significant Q value with zero mutations #28

Closed emham closed 5 years ago

emham commented 5 years ago

Hi,

I am using dndscv with targeted sequencing data and finding that some genes have significant Q values despite have zeros in all mutation count columns (n_syn, n_mis, n_non, n_spl).

> print(sel_cv_noMuts[sel_cv_noMuts$qallsubs_cv < .05, ])

  gene_name n_syn n_mis n_non n_spl wmis_cv wnon_cv wspl_cv      pmis_cv
5    RNF213     0     0     0     0       0       0       0 1.125159e-06
7    DNAH17     0     0     0     0       0       0       0 2.880643e-06
  ptrunc_cv  pallsubs_cv      qmis_cv qtrunc_cv  qallsubs_cv
5 0.1637897 2.706072e-06 0.0000993890 0.8239136 0.0001434218
7 0.1571009 6.470959e-06 0.0001908426 0.8239136 0.0002449720

I find this counterintuitive so wondering if this is a bug or if there is a rationale for it?

Thanks, Emily

im3sanger commented 5 years ago

Hi Emily,

The reason for this behaviour is that dndscv is a two-sided statistical test that will detect any significant deviation from neutrality. A gene can be significant because of positive or negative selection. These genes lack any non-synonymous mutation when given their size and sequence they should have several mutations under neutrality, and so dNdScv interprets this as evidence of negative selection (notice that the columns with the dN/dS ratios -wmis_cv, wnon_cv and wspl_cv- are zero).

Given what we know of negative selection in cancer, it is very unlikely that this is a real result. My suspicion is that you either do not have good coverage in these genes or the mutations in them were filtered out during variant calling. Ideally, only genes with good coverage should be used (gene_list argument) when running dNdScv.

You could make the test one sided by only using p-values when dN/dS (w) is >1 (in which case you could also use pval/2 as the one-sided p-values). However, my advice would be to understand why you are missing mutations in some genes and run dndscv again once the problem is solved. This will likely increase the power by increasing the value of the theta parameter.

Best wishes, Inigo

emham commented 5 years ago

Thanks, Inigo. That makes sense. I thought maybe synonymous mutations had to be present in order for negative selection to be detected. I'll try your suggestions.

im3sanger commented 5 years ago

Thanks Emily. The dNdSloc model (see dndsout$sel_loc), which is a more traditional dN/dS model, does require synonymous mutations to be present to detect negative selection. In contrast, dNdScv uses all mutations available across genes to estimate the background rate expected for each gene (taking into account its sequence, the mutational signatures and covariates), and so does not solely rely on synonymous mutations in a particular gene.

Best, Inigo