im3sanger / dndscv

dN/dS methods to quantify selection in cancer and somatic evolution
GNU General Public License v3.0
212 stars 48 forks source link

Whole genome vs exome differences #20

Closed fpbarthel closed 5 years ago

fpbarthel commented 5 years ago

Hi @im3sanger, I recently noticed that dNdSCV on whole genomes gives very different results from dNdSCV on whole exomes.

For example, from these 23 TCGA samples with both whole genome and whole exome, the results are very different. Should you use different parameters here? Remove non-coding variants before running?

Genome (n=23)

> print(dnds_all$globaldnds)
     name      mle     cilow   cihigh
wmis wmis 1.089097 0.9772747 1.213714
wnon wnon 1.138530 0.8753310 1.480868
wspl wspl 1.188996 0.8521644 1.658966
wtru wtru 1.156912 0.9331895 1.434270
wall wall 1.093992 0.9829469 1.217583
> print(dnds_all$nbreg$theta)
[1] 0.9770172

Exome (n=23)

> print(dnds_all$globaldnds)
     name      mle     cilow   cihigh
wmis wmis 1.187634 1.0674560 1.321341
wnon wnon 1.032641 0.7953789 1.340678
wspl wspl 1.460439 1.0803810 1.974194
wtru wtru 1.182657 0.9626238 1.452984
wall wall 1.183988 1.0655856 1.315548
> print(dnds_all$nbreg$theta)
[1] 11.52464

While limiting the dNdSCV analysis by setting gene_list = known_cancergenes greatly improves the theta, the results stay the same:

Genome (n=23)

> print(dnds_all$globaldnds)
     name       mle     cilow   cihigh
wmis wmis 1.0636743 0.8803616 1.285157
wnon wnon 1.5195347 0.9428170 2.449029
wspl wspl 0.7307095 0.4085577 1.306881
wtru wtru 1.0845740 0.7374624 1.595065
wall wall 1.0665454 0.8845950 1.285921
> print(dnds_all$nbreg$theta)
[1] 2464.472

Exome (n=23)

> print(dnds_all$globaldnds)
     name      mle     cilow   cihigh
wmis wmis 1.258691 1.0613330 1.492749
wnon wnon 1.968530 1.3019635 2.976359
wspl wspl 1.180850 0.7593793 1.836246
wtru wtru 1.519065 1.1057168 2.086935
wall wall 1.280506 1.0821167 1.515268
> print(dnds_all$nbreg$theta)
[1] 3160.932
im3sanger commented 5 years ago

Hello,

dNdScv only considers coding mutations. If you feed whole-genome data, dNdScv will simply discard the mutations that do not occur within coding exons or splice sites before doing anything else. Thus, if you have whole-genome data and feed all mutations or only coding mutations, the results will be the same. This means that the differences that you are seeing between exomes and genomes are not due to dNdScv, but to what must be substantial differences in the coding mutations called in these samples by genome and exome sequencing. You may want to compare the coding mutations called in these 23 samples using dndsout$annotmuts.

In the second part of your question you say that restricting the analysis to cancer genes greatly improves theta. Just keep in mind that a higher theta is not necessarily better. The very high theta values that you get on cancer genes probably reflects that the dataset is too small (too few synonymous mutations per gene in known cancer genes) to estimate theta accurately.

Have a look at the coding mutations found in the 23 samples by genome and exome sequencing, but it feels like the calls from the genome data may have some issues.

Inigo

fpbarthel commented 5 years ago

Hi @im3sanger ,

They are very similar I would say, take a look:

> dim(dnds_wgs$annotmuts)
[1] 2184   17
> dim(dnds_wxs$annotmuts)
[1] 2394   17
> dim(inner_join(dnds_wgs$annotmuts, dnds_wxs$annotmuts))
[1] 1378   17

Amongst the roughly 2200-2400 mutations in each, 1400 are present in both WGS and WXS. Do you think the 800 mutations in the WGS that are not in the WXS are causing this big shift?

The samples in the WGS and WXS are for the most part identical:

> n_distinct(dnds_wgs$annotmuts$sampleID)
[1] 21
> n_distinct(dnds_wxs$annotmuts$sampleID)
[1] 23
> n_distinct(intersect(dnds_wgs$annotmuts$sampleID, dnds_wxs$annotmuts$sampleID))
[1] 20

I did notice there's a pretty big difference in the model outputs, with the WXS model not using any covariates? Why does it chose to ignore them?

> summary(dnds_wxs$nbreg)

Call:
MASS::glm.nb(formula = n_syn ~ offset(log(exp_syn)) - 1, data = nbrdf, 
    init.theta = 11.52463756, link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8285  -0.2469  -0.1892  -0.1424   4.0459  

No Coefficients

(Dispersion parameter for Negative Binomial(11.5246) family taken to be 1)

    Null deviance: 3483.9  on 20090  degrees of freedom
Residual deviance: 3483.9  on 20090  degrees of freedom
AIC: 4528.8

Number of Fisher Scoring iterations: 0

              Theta:  11.5 
          Std. Err.:  42.3 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -4526.767 
> summary(dnds_wgs$nbreg)

Call:
MASS::glm.nb(formula = n_syn ~ offset(log(exp_syn)) + ., data = nbrdf, 
    init.theta = 0.9770171551, link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.1613  -0.2390  -0.1777  -0.1288   5.5053  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.100661   0.061591  -1.634 0.102186    
exp_syn      0.484643   0.748659   0.647 0.517406    
PC1          0.012614   0.004383   2.878 0.004004 ** 
PC2          0.034402   0.005985   5.748 9.06e-09 ***
PC3          0.025411   0.008510   2.986 0.002827 ** 
PC4          0.040367   0.011673   3.458 0.000544 ***
PC5         -0.042842   0.012824  -3.341 0.000836 ***
PC6         -0.034235   0.016207  -2.112 0.034656 *  
PC7         -0.070732   0.018629  -3.797 0.000147 ***
PC8          0.033492   0.023842   1.405 0.160086    
PC9         -0.011490   0.024742  -0.464 0.642356    
PC10        -0.026777   0.027525  -0.973 0.330639    
PC11        -0.026584   0.032752  -0.812 0.416987    
PC12         0.021707   0.031055   0.699 0.484574    
PC13        -0.028620   0.031472  -0.909 0.363152    
PC14         0.006419   0.033855   0.190 0.849621    
PC15        -0.013977   0.035549  -0.393 0.694192    
PC16         0.029865   0.035469   0.842 0.399799    
PC17        -0.021073   0.036355  -0.580 0.562160    
PC18         0.051828   0.040047   1.294 0.195601    
PC19        -0.023557   0.042186  -0.558 0.576565    
PC20         0.010463   0.043566   0.240 0.810204    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for Negative Binomial(0.977) family taken to be 1)

    Null deviance: 3061.2  on 20089  degrees of freedom
Residual deviance: 2940.4  on 20068  degrees of freedom
AIC: 4329.4

Number of Fisher Scoring iterations: 1

              Theta:  0.977 
          Std. Err.:  0.375 

 2 x log-likelihood:  -4283.367 

Floris

im3sanger commented 5 years ago

Thanks Floris. It is hard for me to know exactly what is happening without having the data. But, yes, dndscv will output the same results given the same set of coding mutations, even if you also add loads of noncoding mutations as these will be excluded. So, the reason for these differences has to be in the 800+ mutations that are different between both datasets.

In general, low theta values reflect large unexplained variation of the number of synonymous across genes, so you could look for genes with an apparent excess of synonymous mutations in the WGS dataset. Or even in the 800+ mutations that are unique to the WGS dataset. There may be some SNP contamination or recurrent artefacts.

Regarding your question about covariates, dNdScv will first try to use covariates using the negative binomial framework (glm.nb). But sometimes this can fail or issue a warning (for example when failing to converge), and dNdScv will then resort to running without covariates. Since theta is still very high in the WXS dataset without covariates, this was clearly not a problem in your case.

Best, Inigo