Help troubleshooting global dn/ds

njbernstein commented 4 years ago

Hi there,

I'm working with your package and the results seem to broadly follow what I was expecting, e.g. some known driver genes are close to the top. But the global dn/ds estimate for most types of mutations is <<1.

> print(dndsout$globaldnds)
     name         mle       cilow      cihigh
wmis wmis 0.605290545 0.601694036 0.608908552
wnon wnon 1.027952516 1.013248787 1.042869617
wspl wspl 0.007154394 0.005980972 0.008558032
wtru wtru 0.610550564 0.602069469 0.619151128
wall wall 0.608759395 0.605190342 0.612349496

This is on exome data of about 50,000 samples with relatively low coverage, 30X. I've pretty strictly filtered out variants whose allele fraction appears germline using a binomial test and removed genes with low coverage. So I don't think I have germline contamination.

Anything, in particular, I can look into to troubleshoot this problem, e.g. identify problematic genes?

im3sanger commented 4 years ago

Hello,

It is difficult for me to give a good answer based on this information alone. Overall, these ratios look very strange. Are these cancer exomes without a matched normal?

If this is cancer data, the low dN/dS ratios for missense mutations are worrying and could suggest SNP contamination. I would recommend checking whether your mutations overlap known SNP sites (e.g. gnomAD). This, however, does not explain the high dN/dS ratios for nonsense mutations, which I cannot explain. Could you provide more information on this dataset?

The dN/dS ratios for essential splice site mutations are near zero, suggesting that the mutation calls have been filtered out to regions overlapping exonic sequences. I wonder whether the depletion of missense mutations may also be due to some filtering of the data?

The simplest explanation I can think of is some unusual filtering of the variants based on their functional impact and their overlap with coding sequences.

Inigo

njbernstein commented 4 years ago

Hi there,

Thanks for the quick response. This exome data is from PBMCs of a healthy cohort. FASTQs were aligned to grch38. Variants were called using Mutect2 and the default Mutect2 filters were applied, i.e. only variants with a PASS filter. The only additional filters I have added manually are requiring the alternate allele to be supported by 4 alleles, and the p-value for a binomial test to <.001 with the null being a rate of .5 .

I am using grch38.p12 assembly from https://github.com/im3sanger/dndscv_data/blob/master/data/RefCDS_human_GRCh38.p12.rda for annotation.

Nothing is popping out in terms of filtering that would be causing these issues. For the splice sites is it possible these are underrepresented due to the underlying bait design, e.g. the edges of exons were not baited well? My understanding is most splice-sites occur at the edges of exons, but let me know if this is incorrect.

I think I will go back to the unfiltered data and try to track down what filtration steps are causing this.

Not sure if this will be helpful but according to snpEFF these are the most common variant types in my filtered data:

missense_variant                                                                    776607
synonymous_variant                                                                  412591
downstream_gene_variant                                                              90769
3_prime_UTR_variant                                                                  88956
upstream_gene_variant                                                                74423
5_prime_UTR_variant                                                                  51459
stop_gained                                                                          41103
frameshift_variant                                                                   40737
disruptive_inframe_deletion                                                          31601
non_coding_transcript_exon_variant                                                   28822
intron_variant                                                                       26894
missense_variant&splice_region_variant                                               20407
structural_interaction_variant                                                       19256
intergenic_region                                                                    13047
splice_region_variant&synonymous_variant                                             12135
TF_binding_site_variant                                                              11538
splice_region_variant&intron_variant                                                 10610
conservative_inframe_deletion                                                        10599
sequence_feature                                                                      7741
5_prime_UTR_premature_start_codon_gain_variant                                        6214
disruptive_inframe_insertion                                                          5805
splice_acceptor_variant&splice_region_variant&5_prime_UTR_variant&intron_variant      5112
start_lost                                                                            4573
splice_donor_variant&intron_variant                                                   3935
conservative_inframe_insertion                                                        3660
splice_region_variant&non_coding_transcript_exon_variant                              2243
disruptive_inframe_deletion&splice_region_variant                                     1631
stop_lost                                                                             1505
splice_region_variant                                                                 1264
frameshift_variant&splice_acceptor_variant&splice_region_variant&intron_variant       1216
TFBS_ablation                                                                         1072
frameshift_variant&stop_gained                                                         765
protein_protein_contact                                                                750
frameshift_variant&splice_donor_variant&splice_region_variant&intron_variant           740
stop_retained_variant                                                                  714
splice_acceptor_variant&intron_variant                                                 681
stop_gained&splice_region_variant                                                      650
frameshift_variant&splice_region_variant                                               578
initiator_codon_variant                                                                285
frameshift_variant&stop_lost                                                           220

im3sanger commented 4 years ago

I see. Some depletion is possible at the edges of exons due to bait capture, although coverage in exome sequencing normally extends into the introns. In my experience, there is little depletion of splice sites from experimental reasons in exome sequencing (e.g. TCGA). The depletion in your data is almost complete, suggesting that splice sites may have been considered "non-coding" and excluded from the output.

Given the calling strategy that you described, I would recommend checking for overlap with known population SNPs (gnomAD, 1000 genomes...). With a SNP rate on the order of 1 SNP per 1000 sites (1e-3), even if the assumptions of the binomial test held perfectly (e.g. no overdispersion in read counts or copy number changes), the test with p<0.001 (without multiple testing) may still falsely reject the null in 0.001 of cases, giving you an apparent somatic rate on the order of 1e-6 (higher than the somatic mutation rate in PBMCs).

Apologies if the considerations above do not make sense, as I only have a brief description of the methods. But I would recommend checking overlap with gnomAD and potentially filtering out known SNP sites. Filtering out common SNPs should remove >99% of SNPs, depending on the genetic background of the samples.

njbernstein commented 4 years ago

@im3sanger That makes sense.

I do filter sites against 1000 genomes within Mutect2, providing 1000 genomes as a panel of normals. I use gnomAD as a germline-resource input into Mutect2 which it uses in its calculation for determining how likely the site is to be a somatic mutation. I'll try filtering out sites with any representation in gnomAD to be more explicit though.

The binomial test is just trying to get rid of Mutect2 being overly ambitious with its ability to differentiate somatic and germline variants. When I look at the distribution of allele frequencies across all variants it looked like so:

After filtering using the binomial test:

im3sanger commented 4 years ago

Interesting... Out of curiosity, you may want to run dNdScv on VAF<15% and VAF>20%. The second "peak" with higher VAFs after the binomial test may represent residual SNP contamination. Even filtering out common SNP sites (~99% of the SNPs in an average patient), the SNP density will be ~1e-5 SNPs/bp. With a hypothetical false rejection of the binomial null ~1e-3 and 50,000 whole exomes (3e7 bp * 5e4 samples), one could still expect a considerable number of SNPs leaking through. Those would be rarer SNPs, expected to have higher dN/dS ratios than common SNPs, but still potentially confounding the analyses.

Since the frequency of clonal haem is well understood at different ages, it may be helpful to show that your calls follow the known age incidence.

njbernstein commented 4 years ago

Results: Variants with AF > .2

     name        mle      cilow    cihigh
wmis wmis 0.77463849 0.76690798 0.7824469
wnon wnon 0.54932681 0.53243997 0.5667492
wspl wspl 0.01348636 0.01060938 0.0171435
wtru wtru 0.35113873 0.34056354 0.3620423
wall wall 0.74370912 0.73635669 0.7511350

variants with AF < .15:

wmis wmis 0.5414898 0.537422692 0.545587607
wnon wnon 1.1919382 1.171761926 1.212461848
wspl wspl 0.0059115 0.004667881 0.007486446
wtru wtru 0.6855784 0.674366141 0.696977039
wall wall 0.5534463 0.549350084 0.557573070

pre-filtering AF per variant type:

post-filtering AF per variant type:

All the CHIP and clonal haem results have been as expected thus far in terms of incidence with age and an increased hazard for a variety of indications.

I'm going back to the completely unfiltered results to see if I can track down these issues. Thanks for your suggestions thus far!

njbernstein commented 4 years ago

The solution ended up being removing variants that occurred in too many samples. I was hesitant to over filter these initially as there are known hotspots in clonal hematopoiesis, e.g. occur in 100 samples out of 50,000, but these can be dealt with after the dnds analysis.

im3sanger / dndscv

Help troubleshooting global dn/ds #51