streetslab / dimelo

python package for analysis of dimelo-seq & nanopore modified base data
MIT License
3 stars 5 forks source link

Setting threshA/C = 0 in dm.plot_enrichment_profile() function plots fluctuations instead of flat lines #38

Closed MichalRozenwald closed 6 months ago

MichalRozenwald commented 1 year ago

When setting the threshA and threshC parameters to 0 in the dm.plot_enrichment_profile() function from the the dimelo package using the dimilo test ctcf demo data from github in the rolling avg plots we get lines that are fluctuating (CTCF_demo_A+CG_sm_rolling_avg.pdf). The expectation was to see flat horizontal y-lines corresponding to the fractions equal to 1 instead of fluctuations as all the regions are supposed to be considered methylated when setting the thresholds to be equal to zero. What could be the reason for that?

Here is the code:

import dimelo as dm
bam = "/content/dimelo/dimelo/test/data/ctcf_demo.sorted.bam"
sampleName = "CTCF_demo"
outDir = "/content/dimelo_ctcf_test_out"
bedPeak = "/content/dimelo/dimelo/test/data/ctcf_demo_peak.bed"
bedNotPeak = "/content/dimelo/dimelo/test/data/ctcf_demo_not_peak.bed"

dm.plot_enrichment_profile(bam, sampleName,  bedPeak,   "A+CG",
    outDir + "/plot_enrichment_profile_threshAC0",
    threshA=0, threshC=0)

And here is a Google Colab Notebook installing dimelo package and reproducing these results and figures: https://colab.research.google.com/drive/1ajMwrk98OVuT176DafGKxEuaH7U55-OV?usp=sharing

OberonDixon commented 6 months ago

My guess here is that this is because, for a single region, the behavior used to be that the rolling average (default size 25 or something) would get zeros for places with no valid or modified bases in the pileup. This is now changed with the modkit_parsing_beta branch, with the implementation of JIRA issue DM-1 with this commit.

However, we also do not allow thresholds of zero with this version; dimelo.parse_bam.pileup and dimelo_parse_bam.extract will raise errors. A threshold of zero is a meaningless analysis as it will by definition be 100% everywhere where the motif is present.