p values using regions option

lucy924 commented 4 months ago

Hi, I'm looking for some advice. I've been using dmr multi on version 0.2.8 with the regions as CpG Islands. This does not output the MAP-based p-value, as I understand it because this is on regions not single sites. I'm not a statistician - I have looked at issues #93 and #122 but I was wondering if you had any advice on finding significance of the score values using the regions option? My experiment setup is the same as @EpiAllele mentioned in #122 . Most of my scores are <20 but in a single paired test I have a few between 20 and 40, one at 68 and one at 282. The other pairs have a similar type of spread, although not with any higher than 100. I appreciate the great package! Thank you!

ArtRand commented 4 months ago

Hello @lucy924,

Significance tests over regions are tricky to get right in a general way. As the regions get larger or the number of modified bases gets more dense, the number of regions that end up being "significant" increases simply because the test will become over powered. I appreciate that experimenters often want some kind of decision function with which to say "these are differently methylated regions". Here are a couple of ideas:

Sounds like most of the CpG islands have low scores and you have a few that are outliers, you could look at the distribution of scores among pairwise comparisons in the same condition (i.e. control vs control). Then decide on a percentile threshold. Scores above, say the 95th or 98th percentile, could be considered different.
You could qualify a region as different if some fraction of the individual modified positions have a MAP-based p-value less than a threshold (granted, now you have to decide on two thresholds).
You could run the program with --segment and decide a region is different if the whole region or some proportion is labeled as "different".

One thing I'd like to add soon is to emit the posterior probabilities in the segmentation output, so you could say that "this region is labeled as 'different' with X probability", I just haven't gotten around to implementing that yet.

Sorry I don't have a more concrete suggestion, let me think about it a little more.

lucy924 commented 4 months ago

Thank you for your advice! I'll explore more with these in mind.

ArtRand commented 4 months ago

Great, feel free to re-open this issue if you have any additional questions. I'll ping here if I think of some additional advice.

nanoporetech / modkit

p values using regions option #188