Figure out statistical method to compare CpGs in various categories

yaaminiv commented 4 years ago

Chi-squared with some sort of post-hoc test...?

@shellytrigg which lines of your geoduck code should I look at

shellywanamaker commented 4 years ago

@yaaminiv lines 483-769 of this code: https://github.com/hputnam/Geoduck_Meth/blob/master/RAnalysis/Scripts/DMR_stats_and_anno.Rmd. I realize in the slack message it wasn't easy to see the lines reference at the end of the link, sorry about that!

yaaminiv commented 4 years ago

I was able to translate your code to look at sig diff between CpG types, so I think I have my method. Thanks again @shellytrigg (also broom::tidy is amazing)!

Screen Shot 2020-05-12 at 3 52 57 PM

yaaminiv commented 4 years ago

Pairwise chi-sq tests use count data, but we really want to compare proportions between genome/methods. Need to think of new method

shellywanamaker commented 4 years ago

@yaaminiv can you post a link to your code?

yaaminiv commented 4 years ago

@shellytrigg https://github.com/hputnam/Meth_Compare/blob/master/scripts/Characterizing-CpG-Methylation-5x-Union-Summary-Plots.Rmd

shellywanamaker commented 4 years ago

found this info about chisq test : "Because of how the Chi-Square value is calculated, it is extremely sensitive to sample size – when the sample size is too large (~500), almost any small difference will appear statistically significant. " Based on that it makes sense that everything is significant because we are using the number of CpGs as input.

@yaaminiv What about using the percent data in the chisq test so that counts are converted to proportions out of 100? e.g. if you want to compare 17% strongly methylated CpGs to 8% strongly methylated CpGs you would input matrix(c(17,83,8,92), ncol =2) into chisq.test. That way, we'd be testing the difference in the proportions that we are looking at in the plots themselves. Does that make sense?

In looking over things I also noticed a discrepancy in the CpG proportions plots where WGBS shows the largest proportion of strongly methylated CpGs, but the CpG type tables (e.g. https://github.com/hputnam/Meth_Compare/blob/master/analyses/Characterizing-CpG-Methylation-5x-Union/Mcap/Mcap_union-CpG-Type.txt) show MBD as having the largest proportion of strongly methylated CpGs based on your jupyter notebook. I'm not sure I'm interpretting this correctly, but may be worth a second look.

yaaminiv commented 4 years ago

I'm considering using an ANOSIM or perMANOVA to test for differences in methylation status (high, moderate, low) or CpG genomic location (CDS, intron, upstream flank, downstream flank, intergenic) between methods. I think we'd have more statistical power if I did this using information from individual samples instead of union bedgraphs. That way, we'd also have the ability to look at between-sample variation if we were interested.

@shellytrigg @hputnam @mgavery thoughts?

yaaminiv commented 4 years ago

In looking over things I also noticed a discrepancy in the CpG proportions plots where WGBS shows the largest proportion of strongly methylated CpGs, but the CpG type tables (e.g. https://github.com/hputnam/Meth_Compare/blob/master/analyses/Characterizing-CpG-Methylation-5x-Union/Mcap/Mcap_union-CpG-Type.txt) show MBD as having the largest proportion of strongly methylated CpGs based on your jupyter notebook. I'm not sure I'm interpretting this correctly, but may be worth a second look.

Oh snap thanks for pointing that out! My fault for not labelling the row names and then not selecting the correct rows for making the barplots. Fixing that now!

hputnam commented 4 years ago

ordinal logistic regression Ho: there is equal probability of being high mod or low across method and species

e.g., https://github.com/njsilbiger/UnproReviewsInSTEM/blob/master/UnproReviews_scripts.md#logisitic-regression-on-probability-of-recieving-an-unpro-review

For CpG by feature multi-nominal logistic regression

yaaminiv commented 4 years ago

Received a suggestion to use both a multivariate and generalized linear model approach with individual sample-level data, with the caveat that the multivariate method may have the most power.

GLM approach:

I have a feeling that you will be a bit underpowered in the linear models but I think you can run three models (one for each methylation status) and use a p-value correction if you think it applies here to determine which of the methods are different from each other within methylation status. I’d caution against trying to run an ideal full model (including all statuses and methods) since that would then violate independence assumptions (since they are all proportions and add to 1). In these models, since you are interested using the percent/proportion as your response value, you can use a glm with a beta error distribution (which runs on proportions).

Multivariate approach:

As for the multivariate methods, I feel like compositional multivariate methods may work well here. This frame work is what I have been most recently using for bacterial metagenomic datasets that I’ve been working with (see https://www.sciencedirect.com/science/article/pii/S0048969720325985 for an example of where I’ve used it). I think this would be ideal for your questions since you are interested in the proportions (ie the composition) of CpG loci that fall into the different methylation categories and if that varies by sequencing method and not let it be influenced by the external factors that influence CpG reads. You transform the data using a centered log-ratio transform and then calculate the Euclidean distance matrix of the transformed data and then proceed normally. These actions leave you in the ‘Aitchison’ framework for multivariate compositional analysis.

yaaminiv commented 4 years ago

Tried multivariate methods with centered log-ratio transformation in this script. Running into errors that suggest I have insufficient data when conducting pairwise ANOSIMs between methods, but I was able to conduct a global ANOSIM without any problem. I am asking the person that suggested this method what they recommend.

yaaminiv commented 4 years ago

Revised statistical analysis in this script. For each dataset (CpG methylation status or CpG genomic location) I:

Performed PCoA and global perMANOVA to understand differences based on sequencing method. Conducted follow-up pairwise perMANOVAs as well.
Used GLM to look at method differences for individual methylation statuses or genomic locations (we do not have the power to run a full model for either of those categories). Corrected p-values for statistical output.

Questions:

I think Meth 8 may be an outlier. I can test and confirm, but do we want to remove outliers?
The models are set up to compare RRBS and MBD-BS to WGBS. I assume we also want to compare RRBS and MBD-BS?

yaaminiv commented 4 years ago

Finalized statistical analysis in this script. For each dataset (CpG methylation status or CpG genomic location) I:

Performed PCoA and global perMANOVA to understand differences based on sequencing method. Conducted follow-up pairwise perMANOVAs as well.
Used GLM to look at method differences for individual methylation statuses or genomic locations (we do not have the power to run a full model for either of those categories). Corrected p-values for statistical output.
Ran post-hoc pairwise estimated marginal means tests on GLM output to understand differences between all three methods

I still think Meth 8 may be an outlier, but I'm not sure what to do with that.

yaaminiv commented 4 years ago

Revised methods and results with statistical approach and output

yaaminiv commented 3 years ago

@DrK-Lo Here's our previous thread where we discussed which analytical method to use when comparing methylation categories and overlaps with genome features between the three methods.

You asked why we didn't use some form of contingency test. Based on my understanding of chi-squared tests, they are sensitive to to sample size so any large sample size could yield statistically significant differences, and I can't use them with proportion data (which is really want we want to test). That's why I ended up going with the PCoAs/PERMANOVAs/glms in the final script (found here).

Do you have any suggestions for a better statistical test? Did I misinterpret the limits of contingency tests?

hputnam / Meth_Compare

Figure out statistical method to compare CpGs in various categories #68