nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
140 stars 8 forks source link

Normalized the DMR scores #261

Open Congnguyenn opened 1 month ago

Congnguyenn commented 1 month ago

Hello @ArtRand, Thank you for your tool, It helps me a lot, I have a couple of questions related to the DMR score when I am trying to find the DMR (TSS regions provided by a bed file) in 2 samples with a fixed size (2kbps/region). The command I used is as follows

modkit dmr pair \ -a \${methylbed[i]} \ -b \${methylbed[j]} \ --regions-bed \${filename_i}_\${filename_j}.${bedtype}.intersected \ --min-valid-coverage ${cov_cutoff} \ --ref ${reference} \ --missing quiet \ --base C \ --threads ${task.cpus} \ --header \ --log-filepath \${filename_i}_\${filename_j}.dmr.log \ --out-path \${filename_i}_\${filename_j}.dmr \ --force

So my questions are:

Congnguyenn commented 1 month ago

Hello @ArtRand It would be nice to have your inputs on the above queries.

Thanks

ArtRand commented 1 month ago

Hello @Congnguyenn,

I apologize for the delay in getting back to you.

Should I normalize the score with the number of CpG sites in a certain TSS and the depth at each TSS (or whatever regions), therefore we will have a more standardized score, If yes, is there any available or suggested formula?

Looking at your first set of plots, are the low-%change/high-scoring (circled) points particularly long TSS? Did you perform 5hmC calling? To answer your question, I don't have a suggested formula. Normalizing by the number of CpGs is a reasonable idea. One potential draw back, however, is there may be TSSs that have very dense GpGs, then compared to a TSS with lower density and the same count, there will be less variability in the dense CpG TSS since correlation of neighboring CpGs tends to be high.

Should I consider the indeed DMR regions based on both score (from Modkit) and difference_pct_modified (As I calculated above), rather than solely based on the score? if yes, can you suggest a suitable cut-off? for example: score >= 4 and difference_pct_modified >=50? as illustrate in this image

I'm working on a method that will let users "count" the number of DMRs in regions, but I don't have anything yet. Your decision function seems reasonable to me, you could also try to make a null distribution for your problem. The latter is closer to the route I've been going down.

Congnguyenn commented 1 month ago

Thank @ArtRand for your response!

Looking at your first set of plots, are the low-%change/high-scoring (circled) points particularly long TSS? Did you perform 5hmC calling?

No, I set a fixed TSS size of 2000 bps, but the number of CpG sites within each TSS can vary. I only analyzed 5mC using the command modkit adjust-mods to combine 5hmC and 5mC. Additionally, I want to understand if the number of reads in a region (depth) affects the score value. For example, consider two TSS regions (TSS1 and TSS2). TSS1 has lower depth in sample A and sample B (10X and 12X, respectively), while TSS2 has higher depth in sample A and sample B (50X and 60X, respectively). Does the different score value get affected by the different depth?

To my knowledge, higher depth increases confidence in probability (due to more observations lead to smaller variance in the posterior) and does not affect the score. Only the number of CpG sites and the ratio between methylated and unmethylated sites affect the score.

I also want to know if are you working on functional annotation, the tool needs to consider the methylation status (methyl or unmethyl), I believe this would be a useful feature because it helps us to interpret the different

ArtRand commented 1 month ago

Hello @Congnguyenn,

I want to understand if the number of reads in a region (depth) affects the score value. For example, consider two TSS regions (TSS1 and TSS2). TSS1 has lower depth in sample A and sample B (10X and 12X, respectively), while TSS2 has higher depth in sample A and sample B (50X and 60X, respectively). Does the different score value get affected by the different depth?

It depends. Higher depth will allow the model to be more confident that the observed methylation distribution is close to the true (latent) methylation distribution. The effect of higher coverage is you'll get higher scores at smaller effect sizes (differences in methylation). If the effect size is very small or zero, increasing coverage will not increase the score. Practically speaking however, you'll probably observe higher scores in regions with higher coverage, but I don't think it makes these scores "incorrect". I could see how you might want to sub-sample the regions of high coverage to be equivalent to the regions of lower coverage, this functionality isn't available in modkit dmr pair with --regions right now. Let me think about it and do some experiments and maybe I can add it into the next release (should be pretty easy).