laurahspencer / DuMOAR

0 stars 0 forks source link

Decide on samples to remove for MBD-BS analysis #17

Closed sr320 closed 1 year ago

sr320 commented 1 year ago

based on https://github.com/laurahspencer/DuMOAR/issues/15 I think we should remove some samples from analysis.

sr320 commented 1 year ago

Further, based on https://github.com/laurahspencer/DuMOAR/issues/18

I say we remove samples with low %mCpG - described and shown https://d.pr/v/iIu0N7

laurahspencer commented 1 year ago

Before we fully commit to discarding samples - see Mac's last comment on issue #15. Her option 2 seems plausible, and we might have an option to analyze loci that are common among all samples (I still need to investigate the data on the per-locus basis)

sr320 commented 1 year ago

I saw that but disagree with logic 😃 - concerned 1) how to explain, but more importantly what else might be different. What this would appear to be is analyzing WGBS and MBD together. If you want to see how many common loci have 10x coverage that would be ok, and see how much data would need to be discarded. Then we can decide if it is better to through out xx loci versus x samples.

What would the treatment sample size look like if those weird samples were removed?

laurahspencer commented 1 year ago

If we throw out just the worst offenders, we'd have 9 OA samples and 6 ambient samples. If we are more conservative, we'd have 7 OA and 5 ambient.

These figures are pretty illustrative of how samples group - they show % meth ~ # loci in each sample meeting at min 10x (top) and 15x (bottom) coverage.

image

%meth ~ coverage

A note on what I'm using for sample IDs - here are a few MBDSeq data files, in bold and italics are the sample #'s that I'm using in figures, etc., which are S2, S10, and S17

sr320 commented 1 year ago

Thanks - this is useful.. I say we need to drop 5-7 samples.

laurahspencer commented 1 year ago

Make correlation plots in methylkit after filtering for 10x across ALL samples (n=~750 common loci)

laurahspencer commented 1 year ago

To see if the "weird" samples still are weird if we only look at commonly sequenced loci, I filtered for loci that have 10x coverage across all 20 samples, resulting in 745 loci (not many). Here is a PCA using those 745 loci:

image

And here is a link to a pairwise correlation plot among samples (@mgavery): pairwise-correlation-all10x.pdf

Both figures suggest that those weird samples (5, 6, 9, 13, 14, 16, 19) are still weird.

mgavery commented 1 year ago

Thanks, Laura. Sample 19 looks kind of weirdo too when I look at the correlation plot. No second bump of high meth in the histogram and low correlation with other "good" samples.

sr320 commented 1 year ago

So we remove 7 (9,5,13,14,6,16,19)?

laurahspencer commented 1 year ago

Yep- will not include those 7 samples in the analyses. That leaves us with 5 samples from control crab, and 7 samples from high-pCO2 crab.

laurahspencer commented 1 year ago

I'm writing up our sample removal justification, and running some methylkit analyses, and sample 8 is also a bit suspect - it has pretty low % CpG methylation, too (54% across all sequenced loci), and is an outlier in PCA constructed from loci meeting 10x coverage acorss all "good" samples. If we also removed sample 8, all the remaining samples would have a % CpG methylation minimum of 66% across all sequenced loci (range from 66%-75%), and we'd have 6 control & 6 high pCO2 samples.

image