nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
136 stars 7 forks source link

p values using regions option #188

Closed lucy924 closed 4 months ago

lucy924 commented 4 months ago

Hi, I'm looking for some advice. I've been using dmr multi on version 0.2.8 with the regions as CpG Islands. This does not output the MAP-based p-value, as I understand it because this is on regions not single sites. I'm not a statistician - I have looked at issues #93 and #122 but I was wondering if you had any advice on finding significance of the score values using the regions option? My experiment setup is the same as @EpiAllele mentioned in #122 . Most of my scores are <20 but in a single paired test I have a few between 20 and 40, one at 68 and one at 282. The other pairs have a similar type of spread, although not with any higher than 100. I appreciate the great package! Thank you!

ArtRand commented 4 months ago

Hello @lucy924,

Significance tests over regions are tricky to get right in a general way. As the regions get larger or the number of modified bases gets more dense, the number of regions that end up being "significant" increases simply because the test will become over powered. I appreciate that experimenters often want some kind of decision function with which to say "these are differently methylated regions". Here are a couple of ideas:

One thing I'd like to add soon is to emit the posterior probabilities in the segmentation output, so you could say that "this region is labeled as 'different' with X probability", I just haven't gotten around to implementing that yet.

Sorry I don't have a more concrete suggestion, let me think about it a little more.

lucy924 commented 4 months ago

Thank you for your advice! I'll explore more with these in mind.

ArtRand commented 4 months ago

Great, feel free to re-open this issue if you have any additional questions. I'll ping here if I think of some additional advice.