Closed Ge0rges closed 8 months ago
Hello,
Hello @Ge0rges,
Sorry about the delay. Thanks for bringing to my attention an error in the documentation. theta_a
, theta_b
and theta_a+b
are all calculated the same way, they are the posterior distribution given the data X
using Jeffrey's prior. Calling these parameters the MAP is inaccurate, so I've edited the documentation on the scoring details. As far as the score, I've excised the parts of the code that perform the calculation for the 2-way, methylated/canonical, case below. For multiple modifications, the calculation is essentially the same.
// control_methyls and exp_methyls are the counts of methylated bases
// in the first sample (called the control in the code) and the second sample
// (called the experiment in the code), respectively
// Similarly, control_canonicals and exp_canonicals is the counts of
// canonical, unmodified, bases, respectively. `beta_llk` calculates the log marginal likehood
// of the data under the posterior distribution
let llk_control = beta_llk(control_methyls, control_canonicals);
let llk_exp = beta_llk(exp_methyls, exp_canonicals);
let llk_same = beta_llk(
exp_methyls + control_methyls,
exp_canonicals + control_canonicals,
);
// llk_control, llk_exp, and llk_same are the log marginal likelihood of the data.
// The score is calculated in log-space, note that the actual code (linked below)
// just returns this value.
let score = llk_control + llk_exp - llk_same;
You can find this code here, and the rest of the DMR calculation code in that source file.
Thank you. I was also wondering if you've considered implementing an exact test such as Fischer's for cases where coverage is low?
For reference, in case others should stumble upon this, I am currently determining statistical significance by applying a chi-square test after applying a coverage filter thanks to the new dmr flag. Code for which is below (python):
from scipy.stats import chi2
def get_statistically_significant_scores(df, alpha=0.05):
"""
Identify statistically significant scores in a DataFrame using chi-square test.
Parameters:
df (pandas.DataFrame): DataFrame containing a column 'score' with Modkit scores.
alpha (float): Significance level for the statistical test (default is 0.05).
Returns:
pandas.DataFrame: DataFrame with an additional column 'is_significant'.
"""
# Calculate test statistics
test_statistics = 2 * df["score"]
df_chi2 = 2 # Degrees of freedom for the chi-square distribution
# Adjust significance level for multiple tests (Holm-Bonferroni method)
num_tests = len(df)
adjusted_significance_level = alpha / num_tests
# Critical value from chi-square distribution
critical_value = chi2.ppf(1 - adjusted_significance_level, df_chi2)
# Determine significant results and update DataFrame
df['is_significant'] = test_statistics > critical_value
return df[df['is_significant']]
I appreciate that you're trying to find a reliable decision function and that the scoring method is probably not being overly helpful in that sense.
Yes, I considered Fischer's and chi-squared, I didn't add these into mainline because I wanted a metric that was stable to changes in region size (1000s of CpGs to a single CpG) and coverage. In my experiments, this scoring scheme seemed to work well across a variety of use cases while also giving good resolution (e.g. not putting everything into 2 bins).
Of course, you can still calculate another test statistic after parsing the output table because all of the counts are in there. Now that the "per-base" option is available, maybe I'll reconsider adding these contingency tests with suitable warnings around coverage, etc.
I think the scoring is useful! I've just been trying to complement it with methods to get statistical significance.
I'm curious as to how you've seen chi-squared perform over a function of region size and coverage. I've been thinking about that as well, and hoping the region size (in my case considering genes) is on average the same, and applying a coverage filter gives me at least some confidence there.
I think the scoring is useful!
As always, glad to hear it.
I've just been trying to complement it with methods to get statistical significance.
If you're willing, I'd be interested in hearing what methods you end up using and the use case.
I'm curious as to how you've seen chi-squared perform over a function of region size and coverage.
I'll scrounge up some examples showing the difference. It would be good to have on the docs or a blog.
I'm currently experimenting with chi-squared to determine the set of genes that are DMRed in a statistically significant way. I'm currently working on seeing how the type I error rate, and test power fluctuate with a simulated binomial. Currently it seems like low coverage is going to cause me to shift away from this and towards an exact test (my coverage is ~5 which is quite low). I'm trying to figure how to best implement that with the modkit output however since it requires two groups and getting that data back out from dmr
is not immediately apparent to me, but I only recently started thinking about this.
I've got a couple ideas to try, let me do some experiments on my side and circle back.
I'm trying to figure how to best implement that with the modkit output however since it requires two groups and getting that data back out from dmr is not immediately apparent to me
What do you need exactly? You should be able to get the two groups out of the dmr
output?
You're right I guess it should be sufficient to use set of samplea_fractions
and sampleb_fractions
.
Would it be correct to say that for a given modification type subtracting one fraction from the other would give the methylation difference?
Would it be correct to say that for a given modification type subtracting one fraction from the other would give the methylation difference?
Yes, as long as you're comparing only one methylation state at that base. For example, if you're input data has only 5mC at cytosine bases and 6mA at adenine bases, then subtracting the two values will give you the difference in methylation between the samples. If your data has more than one methylation state, for example 5hmC and 5mC at cytosine bases, then, as you probably guess, subtracting will give you the difference in methylation state for that modification. To get the difference in methylation levels in the latter case, sum the fractions before taking the difference.
Hey @ArtRand, I'm circling back to this and was wondering if you've found any better statistical approaches to picking out the relevant (confident) set of DMR scores? I'm tempted to stick with my previously outlined method currently.
Hey @Ge0rges,
Thanks for checking in. I haven't come up with anything I'm happy with yet, but I'm actively working on it.
Hello @Ge0rges,
The latest version of modkit (v0.2.5) has a MAP-based p-value feel free to give it a try. It should run a lot faster too.
Hi Arthur,
I was talking to some statistics PhDs about different methods I could use on my end to filter out statistically significant data from the
dmr
scores. One of them wrote this comment that I wanted to pass by you:In reference to the
dmr
score, why is it that you can have negative scores? Is this because you mix bayesian MAP estimates and MLE? Hope this is relevant (and perhaps even helpful). Thanks!