hputnam / Meth_Compare

5 stars 2 forks source link

histograms of percent methylation for CpGs that overlap between methods #98

Closed shellywanamaker closed 3 years ago

shellywanamaker commented 4 years ago

These plots aim to address Katie's comments about the bias in MBD data and look at percent methylation of CpGs uniquely and commonly identified across methods.

I made the following plots using the 5xCpG union bedgraph for P. acuta (https://gannet.fish.washington.edu/metacarcinus/FROGER_meth_compare/20200424/10-unionbedg/Pact_union_5x.bedgraph) using this Rmarkdown file (https://github.com/hputnam/Meth_Compare/blob/master/code/5xUnionBedCpGs_meth_hist.Rmd).

The colors are based off the Upset plot color scheme (red = 1 method, blue = 2 methods, and black = all three methods), and similarly to the upset plot each CpG locus is only counted once (e.g. if a CpG is in the MBD, RRBS group then it is not the MBD group or the RRBS group).

Here's the histograms:

And here is the same data as stacked bar plots:

shellywanamaker commented 4 years ago

Here are some alternatives with the common loci split out by method and colored by method

shellywanamaker commented 4 years ago

updated with median methylation for all loci. Now the number of loci in each facet match the set sizes in the upset plot.

Requiring all three samples within a method to cover the loci. The above plot only requires one sample to cover the loci.

mgavery commented 4 years ago

I really like these!!

On Friday we talked about using these plots to answer the question: Are there any methylated CpGs that are missing from the MBDBS dataset? in the theoretical plot that Katie drew, her concern were the loci that were being called >20% methylated by WGBS but not analyzed in MBDBS.

we thought we could check this by looking at WGBS only (or maybe WGBS only and WGBS/RRBS?) and seeing if there are any "methylated" sites (maybe this is greater than 20 or 25% methylation?). Is it possible to either zoom in on those areas of the stacked bar or extract the number of CpGs from those regions of the stacked bar?

shellywanamaker commented 4 years ago

Updated versions with zoomed in plots for RRBS and WGBS stacked bars (B and C).

Median methylation is shown for loci covered by any sample(s) within a method. This means if only one sample in a method covers the loci, the % methylation from only that sample is plotted, where as if two samples or all three samples in a method cover a loci the median % methylation is calculated.

Median methylation is shown for loci covered by all three samples within a method.

code here: 5xUnionBedCpGs_meth_hist_v2.Rmd

mgavery commented 4 years ago

Yes! That's awesome. So there are certainly some methylated sites that are not captured by MBD, but not a whole lot. <2%

shellywanamaker commented 4 years ago

Correct, I edited the plots to show the number of loci.

DrK-Lo commented 4 years ago

Can you plot (MBD total # reads) vs. methylation level from WGBS? This would more directly get at my question. Analyses like Grove Dixons that don't do the BS conversion on the MBD effectively assume that the read depth of the MBD data is proportional to methylation (if I understand their analysis correctly). I'd like to know how safe this assumption is.

DrK-Lo commented 4 years ago

When we are doing MBD-BSseq, I worry that if one set of samples in one treatment is methylated and samples from the other treatment is unmethylated, that this site would be filtered out before we even do the analysis because half of the samples would be (in theory) missing reads because they were not captured with MBD. Therefore, I'd like to understand if there are no reads at a CpG after MBD enrichment, how safe it would be to assume it is unmethylated. The plot of (MBD total # reads) vs. (methylation level from WGBS) would also get at this.

DrK-Lo commented 4 years ago

In other words, its not only about whether the methylated sites are captured, it also about how often the unmethylated sites are not captured.

shellywanamaker commented 4 years ago

@DrK-Lo that makes sense. I am working on plotting MBD # reads (median) x WGBS % methylation (median) for all loci covered by WGBS without any read thresholding. I'm attempting a heatmap using ggtile with binned axes and the number of loci shown as the heat. I will post once as soon as I get the plot generated.

shellywanamaker commented 4 years ago

code here: cov_x_meth_Pact.Rmd

shellywanamaker commented 4 years ago

Here are some plots with different read thresholds for the WGBS data:

DrK-Lo commented 4 years ago

That is great. A few interesting things are going on here, which may also tie into the other thread. (Welcome thoughts on these conclusions). 1) Quite often unmeythlated sites are covered by MBD (left side of graph). (2) Number of MBD reads is not a good approximation for methylation (by lack of a more 1:1 relationaship in the heatmap) - you all already know this because you already use the BS treatment, but might be good to mention again in the discussion since Dixon's work is highly cited and used this approach.

DrK-Lo commented 4 years ago

(3) Does this graph tell us that we can assume that if a site is not covered by MBD that it is unmethylated or lowly methylated? (Back to my concern about bias of how to filter data if a site is methylated in one treatment and not in the other). I think so. For sites that are not covered by MBD (e.g. less than 5 reads), for the most part these are lowly methylated sites. (Would be interested on getting other's thoughts on whether they would approach data filtering differently, given these results? I think I would. Would be happy to write a paragraph for discussion if you want.)

DrK-Lo commented 4 years ago

4) Does it surprise anyone that as the WGBS filter gets more stringent, fewer sites have high % methylation? I'm trying to wrap my head around this.

shellywanamaker commented 3 years ago

@DrK-Lo

1. Quite often unmeythlated sites are covered by MBD (left side of graph) <
  • Yes:
  • for WGBS_1x loci, MBD covered 52.9% of [0-10%] methylated WGBS loci
  • for WGBS_5x loci, MBD covered 60.5% of [0-10%] methylated WGBS loci
  • for WGBS_10x loci, MBD covered 73.4% of [0-10%] methylated WGBS loci
  • for WGBS_20x-100x loci, MBD covered > 90% of [0-10%] methylated WGBS loci

    1. Number of MBD reads is not a good approximation for methylation (by lack of a more 1:1 relationaship in the heatmap) <
  • Totally agree and one explanation is PCR bias for particular genomic regions that would skew the quantitative ability of MBD reads to correlate with read depth

    1. Does this graph tell us that we can assume that if a site is not covered by MBD that it is unmethylated or lowly methylated? <
  • I think this is generally the case.
  • For re-considering the filtering are you suggesting to use a looser filter because at 5x we are limiting detection of the lowly methylated MBD-captured loci?

    1. Does it surprise anyone that as the WGBS filter gets more stringent, fewer sites have high % methylation? In general there seems to be lower coverage of high % methylation loci across all methods. NOTE: these loci include all those detected by a method and are not restricted to those overlapping between methods.

Although I'm not sure why it's more extreme in WGBS, could it be because the other methods are both indeed enriching for methylated DNA more than WGBS?

shellywanamaker commented 3 years ago

we expect the majority of the genome to have low methylation and a small portion to have more coverage because of copy number variation and/or PCR bias. The heat trend still in WGBS is still similar to MBD AND RRBS, but at loci =< 100x coverage.

DrK-Lo commented 3 years ago

thanks @shellytrigg

For re-considering the filtering are you suggesting to use a looser filter because at 5x we are limiting detection of the lowly methylated MBD-captured loci?

(Keep in mind the following is a proposal up for discussion:)

For a DML analysis with MBD data: Let's say we have 5 individuals from Treatment A (methylated at CpG) and Treatment B (unmethylated at CpG). Instead of filtering by 15x coverage across all 10 individuals, I would filter by 15x coverage for individuals within any one treatment. Then for individuals with low read count (less than 5 reads), I would assume they are lowly methylated. (Could debate on whether to assign a numerical value or category)

For correlating gene expression with methylation (also MBD data): Again, I worry about bias introduced by MBD and filtering by some threshold. Let's say a gene has 100 CpGs, but only 10 of them have a 100% methylation level near the start codon and are captured by MBD, resulting in >10x coverage in the data. The remaining 90 CpGs are unmethylated and not captured by MBD, resulting in 0x coverage in the data. If we filter data by those that have 10x coverage, we would estimate the methylation level for this gene as 100%. However if we had reads for every CpG from WGBS, we would correctly estimate the methylation level of this gene to be very low (100%10Cpgs + 0%90Cpgs)/100Cpgs = 10%. So again, I would propose that while MBD read depth is not a good estimate of methylation level, lack of MBD reads is a good estimate of unmethylated sites (at least based on the above graphs). Happy to chat more!

hputnam commented 3 years ago

@DrK-Lo This also brings up the point that to test for DMGs it would be interesting to have an additional metric that is accounting for methylation level and proportion of CpGs covered... Or to generate another cutoff for DMG analysis that requires a minimum for proportion of CpGs with data within a gene