Closed njbernstein closed 4 years ago
Hello,
It is difficult for me to give a good answer based on this information alone. Overall, these ratios look very strange. Are these cancer exomes without a matched normal?
If this is cancer data, the low dN/dS ratios for missense mutations are worrying and could suggest SNP contamination. I would recommend checking whether your mutations overlap known SNP sites (e.g. gnomAD). This, however, does not explain the high dN/dS ratios for nonsense mutations, which I cannot explain. Could you provide more information on this dataset?
The dN/dS ratios for essential splice site mutations are near zero, suggesting that the mutation calls have been filtered out to regions overlapping exonic sequences. I wonder whether the depletion of missense mutations may also be due to some filtering of the data?
The simplest explanation I can think of is some unusual filtering of the variants based on their functional impact and their overlap with coding sequences.
Inigo
Hi there,
Thanks for the quick response. This exome data is from PBMCs of a healthy cohort. FASTQs were aligned to grch38. Variants were called using Mutect2 and the default Mutect2 filters were applied, i.e. only variants with a PASS filter. The only additional filters I have added manually are requiring the alternate allele to be supported by 4 alleles, and the p-value for a binomial test to <.001 with the null being a rate of .5 .
I am using grch38.p12 assembly from https://github.com/im3sanger/dndscv_data/blob/master/data/RefCDS_human_GRCh38.p12.rda for annotation.
Nothing is popping out in terms of filtering that would be causing these issues. For the splice sites is it possible these are underrepresented due to the underlying bait design, e.g. the edges of exons were not baited well? My understanding is most splice-sites occur at the edges of exons, but let me know if this is incorrect.
I think I will go back to the unfiltered data and try to track down what filtration steps are causing this.
Not sure if this will be helpful but according to snpEFF these are the most common variant types in my filtered data:
missense_variant 776607
synonymous_variant 412591
downstream_gene_variant 90769
3_prime_UTR_variant 88956
upstream_gene_variant 74423
5_prime_UTR_variant 51459
stop_gained 41103
frameshift_variant 40737
disruptive_inframe_deletion 31601
non_coding_transcript_exon_variant 28822
intron_variant 26894
missense_variant&splice_region_variant 20407
structural_interaction_variant 19256
intergenic_region 13047
splice_region_variant&synonymous_variant 12135
TF_binding_site_variant 11538
splice_region_variant&intron_variant 10610
conservative_inframe_deletion 10599
sequence_feature 7741
5_prime_UTR_premature_start_codon_gain_variant 6214
disruptive_inframe_insertion 5805
splice_acceptor_variant&splice_region_variant&5_prime_UTR_variant&intron_variant 5112
start_lost 4573
splice_donor_variant&intron_variant 3935
conservative_inframe_insertion 3660
splice_region_variant&non_coding_transcript_exon_variant 2243
disruptive_inframe_deletion&splice_region_variant 1631
stop_lost 1505
splice_region_variant 1264
frameshift_variant&splice_acceptor_variant&splice_region_variant&intron_variant 1216
TFBS_ablation 1072
frameshift_variant&stop_gained 765
protein_protein_contact 750
frameshift_variant&splice_donor_variant&splice_region_variant&intron_variant 740
stop_retained_variant 714
splice_acceptor_variant&intron_variant 681
stop_gained&splice_region_variant 650
frameshift_variant&splice_region_variant 578
initiator_codon_variant 285
frameshift_variant&stop_lost 220
I see. Some depletion is possible at the edges of exons due to bait capture, although coverage in exome sequencing normally extends into the introns. In my experience, there is little depletion of splice sites from experimental reasons in exome sequencing (e.g. TCGA). The depletion in your data is almost complete, suggesting that splice sites may have been considered "non-coding" and excluded from the output.
Given the calling strategy that you described, I would recommend checking for overlap with known population SNPs (gnomAD, 1000 genomes...). With a SNP rate on the order of 1 SNP per 1000 sites (1e-3), even if the assumptions of the binomial test held perfectly (e.g. no overdispersion in read counts or copy number changes), the test with p<0.001 (without multiple testing) may still falsely reject the null in 0.001 of cases, giving you an apparent somatic rate on the order of 1e-6 (higher than the somatic mutation rate in PBMCs).
Apologies if the considerations above do not make sense, as I only have a brief description of the methods. But I would recommend checking overlap with gnomAD and potentially filtering out known SNP sites. Filtering out common SNPs should remove >99% of SNPs, depending on the genetic background of the samples.
@im3sanger That makes sense.
I do filter sites against 1000 genomes within Mutect2, providing 1000 genomes as a panel of normals. I use gnomAD as a germline-resource input into Mutect2 which it uses in its calculation for determining how likely the site is to be a somatic mutation. I'll try filtering out sites with any representation in gnomAD to be more explicit though.
The binomial test is just trying to get rid of Mutect2 being overly ambitious with its ability to differentiate somatic and germline variants. When I look at the distribution of allele frequencies across all variants it looked like so:
After filtering using the binomial test:
Interesting... Out of curiosity, you may want to run dNdScv on VAF<15% and VAF>20%. The second "peak" with higher VAFs after the binomial test may represent residual SNP contamination. Even filtering out common SNP sites (~99% of the SNPs in an average patient), the SNP density will be ~1e-5 SNPs/bp. With a hypothetical false rejection of the binomial null ~1e-3 and 50,000 whole exomes (3e7 bp * 5e4 samples), one could still expect a considerable number of SNPs leaking through. Those would be rarer SNPs, expected to have higher dN/dS ratios than common SNPs, but still potentially confounding the analyses.
Since the frequency of clonal haem is well understood at different ages, it may be helpful to show that your calls follow the known age incidence.
Results: Variants with AF > .2
name mle cilow cihigh
wmis wmis 0.77463849 0.76690798 0.7824469
wnon wnon 0.54932681 0.53243997 0.5667492
wspl wspl 0.01348636 0.01060938 0.0171435
wtru wtru 0.35113873 0.34056354 0.3620423
wall wall 0.74370912 0.73635669 0.7511350
variants with AF < .15:
wmis wmis 0.5414898 0.537422692 0.545587607
wnon wnon 1.1919382 1.171761926 1.212461848
wspl wspl 0.0059115 0.004667881 0.007486446
wtru wtru 0.6855784 0.674366141 0.696977039
wall wall 0.5534463 0.549350084 0.557573070
pre-filtering AF per variant type:
post-filtering AF per variant type:
All the CHIP and clonal haem results have been as expected thus far in terms of incidence with age and an increased hazard for a variety of indications.
I'm going back to the completely unfiltered results to see if I can track down these issues. Thanks for your suggestions thus far!
The solution ended up being removing variants that occurred in too many samples. I was hesitant to over filter these initially as there are known hotspots in clonal hematopoiesis, e.g. occur in 100 samples out of 50,000, but these can be dealt with after the dnds analysis.
Hi there,
I'm working with your package and the results seem to broadly follow what I was expecting, e.g. some known driver genes are close to the top. But the global dn/ds estimate for most types of mutations is <<1.
This is on exome data of about 50,000 samples with relatively low coverage, 30X. I've pretty strictly filtered out variants whose allele fraction appears germline using a binomial test and removed genes with low coverage. So I don't think I have germline contamination.
Anything, in particular, I can look into to troubleshoot this problem, e.g. identify problematic genes?