entropy vs sequencing depth/CG content

ywang285 commented 1 week ago

Hi, Thank you for providing this great tool! I am interested in calculating CpG entropy, and I am wondering how sequencing depth (such as comparing data with 10X coverage vs. 30X coverage) and CG content (e.g., comparing reads from high CG content region and low CG content region but still have >=4 CG in the window size) affect entropy value. Thank you!

ArtRand commented 1 week ago

Hello @ywang285,

If you don't have deep enough coverage to sample all of $\textbf{N}$ possible patterns (referring to the documentation on the calculation), the entropy calculation will be suppressed. For example in the extreme case, if you have coverage of 1, the entropy will always be 0.0. On the other hand, as coverage increases, once you sample all of the methylation patterns, I would expect the calculation to reach a value close to the "true value". The entropy may continue to grow as coverage increases due to accumulating reads with methylation call errors, but I would expect this growth to be small since the $Pr(n_i)$ of each erroneous pattern will be small.

I have less of an intuition for how CpG density will change entropy. Given a constant number of CpGs (--num-positions) within a smaller interval of the genome (--window-size) you may expect the entropy to be lower since CpGs tend to be spatially correlated (for example, all modified or all unmodified). Whereas as you increase the interval size (keeping the number of CpGs constant) you have more opportunity to sample different biological processes. On the other hand, I could also see dense regions CpGs that have a lot of competition for binding could have very high entropy.

So in general, I would recommend trying to get as high coverage as possible (to a point, 30X/haplotype), and trying multiple settings for --num-positions and --window-size. Let me know if this isn't clear and happy to answer any followup questions.

ywang285 commented 1 week ago

Thank you so much for your reply! I plan to compare the entropy of genomic regions without amplification (10X-20X depth) vs. amplicon regions (>100X depth). However, it sounds that the comparison won't be fair unless I subsample the amplicon region. Is that the case or if you have any suggestions?

ArtRand commented 4 days ago

Hello @ywang285,

I don't think you'll need to sub-sample the amplified reads since they should have very low modification rates. If there is high native methylation entropy in a genomic region it should be noticeable above an amplified background.

nanoporetech / modkit

entropy vs sequencing depth/CG content #214