nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
144 stars 8 forks source link

Question about `dmr` score #93

Closed Ge0rges closed 8 months ago

Ge0rges commented 11 months ago

Hi Arthur,

I was talking to some statistics PhDs about different methods I could use on my end to filter out statistically significant data from the dmr scores. One of them wrote this comment that I wanted to pass by you:

One note we have is that more complex models (i.e. the alternative model) cannot have a smaller likelihood than less complex models (i.e. the null model) when evaluated at the maximum likelihood. This is because the parameter space for the null model is a subset of the parameter space for the alternative model, so the maximizer for the null model must also be an option for the alternative model. Hence, the score as defined by modkit must be positive. However, we see in their documentation a negative score value, which raises some questions for us about if there is a coding error or a piece of information missing in the documentation.

In reference to the dmr score, why is it that you can have negative scores? Is this because you mix bayesian MAP estimates and MLE? Hope this is relevant (and perhaps even helpful). Thanks!

ArtRand commented 11 months ago

Hello,

Hello @Ge0rges,

Sorry about the delay. Thanks for bringing to my attention an error in the documentation. theta_a, theta_b and theta_a+b are all calculated the same way, they are the posterior distribution given the data X using Jeffrey's prior. Calling these parameters the MAP is inaccurate, so I've edited the documentation on the scoring details. As far as the score, I've excised the parts of the code that perform the calculation for the 2-way, methylated/canonical, case below. For multiple modifications, the calculation is essentially the same.

// control_methyls and exp_methyls are the counts of methylated bases
// in the first sample (called the control in the code) and the second sample
// (called the experiment in the code), respectively
// Similarly, control_canonicals and exp_canonicals is the counts of 
// canonical, unmodified, bases, respectively. `beta_llk` calculates the log marginal likehood
// of the data under the posterior distribution
let llk_control = beta_llk(control_methyls, control_canonicals);
let llk_exp = beta_llk(exp_methyls, exp_canonicals);
let llk_same = beta_llk(
    exp_methyls + control_methyls,
    exp_canonicals + control_canonicals,
);
// llk_control, llk_exp, and llk_same are the log marginal likelihood of the data.
// The score is calculated in log-space, note that the actual code (linked below)
// just returns this value.
let score = llk_control + llk_exp - llk_same;

You can find this code here, and the rest of the DMR calculation code in that source file.

Ge0rges commented 11 months ago

Thank you. I was also wondering if you've considered implementing an exact test such as Fischer's for cases where coverage is low?

For reference, in case others should stumble upon this, I am currently determining statistical significance by applying a chi-square test after applying a coverage filter thanks to the new dmr flag. Code for which is below (python):

from scipy.stats import chi2

def get_statistically_significant_scores(df, alpha=0.05):
    """
    Identify statistically significant scores in a DataFrame using chi-square test.

    Parameters:
    df (pandas.DataFrame): DataFrame containing a column 'score' with Modkit scores.
    alpha (float): Significance level for the statistical test (default is 0.05).

    Returns:
    pandas.DataFrame: DataFrame with an additional column 'is_significant'.
    """
    # Calculate test statistics
    test_statistics = 2 * df["score"]
    df_chi2 = 2  # Degrees of freedom for the chi-square distribution

    # Adjust significance level for multiple tests (Holm-Bonferroni method)
    num_tests = len(df)
    adjusted_significance_level = alpha / num_tests

    # Critical value from chi-square distribution
    critical_value = chi2.ppf(1 - adjusted_significance_level, df_chi2)

    # Determine significant results and update DataFrame
    df['is_significant'] = test_statistics > critical_value

    return df[df['is_significant']]
ArtRand commented 11 months ago

I appreciate that you're trying to find a reliable decision function and that the scoring method is probably not being overly helpful in that sense.

Yes, I considered Fischer's and chi-squared, I didn't add these into mainline because I wanted a metric that was stable to changes in region size (1000s of CpGs to a single CpG) and coverage. In my experiments, this scoring scheme seemed to work well across a variety of use cases while also giving good resolution (e.g. not putting everything into 2 bins).

Of course, you can still calculate another test statistic after parsing the output table because all of the counts are in there. Now that the "per-base" option is available, maybe I'll reconsider adding these contingency tests with suitable warnings around coverage, etc.

Ge0rges commented 11 months ago

I think the scoring is useful! I've just been trying to complement it with methods to get statistical significance.

I'm curious as to how you've seen chi-squared perform over a function of region size and coverage. I've been thinking about that as well, and hoping the region size (in my case considering genes) is on average the same, and applying a coverage filter gives me at least some confidence there.

ArtRand commented 11 months ago

I think the scoring is useful!

As always, glad to hear it.

I've just been trying to complement it with methods to get statistical significance.

If you're willing, I'd be interested in hearing what methods you end up using and the use case.

I'm curious as to how you've seen chi-squared perform over a function of region size and coverage.

I'll scrounge up some examples showing the difference. It would be good to have on the docs or a blog.

Ge0rges commented 11 months ago

I'm currently experimenting with chi-squared to determine the set of genes that are DMRed in a statistically significant way. I'm currently working on seeing how the type I error rate, and test power fluctuate with a simulated binomial. Currently it seems like low coverage is going to cause me to shift away from this and towards an exact test (my coverage is ~5 which is quite low). I'm trying to figure how to best implement that with the modkit output however since it requires two groups and getting that data back out from dmr is not immediately apparent to me, but I only recently started thinking about this.

ArtRand commented 11 months ago

I've got a couple ideas to try, let me do some experiments on my side and circle back.

I'm trying to figure how to best implement that with the modkit output however since it requires two groups and getting that data back out from dmr is not immediately apparent to me

What do you need exactly? You should be able to get the two groups out of the dmr output?

Ge0rges commented 11 months ago

You're right I guess it should be sufficient to use set of samplea_fractions and sampleb_fractions. Would it be correct to say that for a given modification type subtracting one fraction from the other would give the methylation difference?

ArtRand commented 11 months ago

Would it be correct to say that for a given modification type subtracting one fraction from the other would give the methylation difference?

Yes, as long as you're comparing only one methylation state at that base. For example, if you're input data has only 5mC at cytosine bases and 6mA at adenine bases, then subtracting the two values will give you the difference in methylation between the samples. If your data has more than one methylation state, for example 5hmC and 5mC at cytosine bases, then, as you probably guess, subtracting will give you the difference in methylation state for that modification. To get the difference in methylation levels in the latter case, sum the fractions before taking the difference.

Ge0rges commented 9 months ago

Hey @ArtRand, I'm circling back to this and was wondering if you've found any better statistical approaches to picking out the relevant (confident) set of DMR scores? I'm tempted to stick with my previously outlined method currently.

ArtRand commented 9 months ago

Hey @Ge0rges,

Thanks for checking in. I haven't come up with anything I'm happy with yet, but I'm actively working on it.

ArtRand commented 8 months ago

Hello @Ge0rges,

The latest version of modkit (v0.2.5) has a MAP-based p-value feel free to give it a try. It should run a lot faster too.

Ge0rges commented 5 months ago

Dropping some related readings centered around the statistical methods used by bisulfite data analysis tools such as this and this.