smithlabcode / methpipe

A pipeline for analyzing DNA methylation data from bisulfite sequencing.
http://smithlabresearch.org/methpipe
67 stars 27 forks source link

Radmeth reporting whole genome as DMR #221

Closed guilhermesena1 closed 2 years ago

guilhermesena1 commented 2 years ago

Datasets from MethBase: Gao-Human-2015/*Blood*.meth Roadmap-Human-2015/*Esophagus*.meth all from hg38

design-matrix.txt:

  is_blood
Human_BloodALLL1_R1 1
Human_BloodALLL1_R2 1
Human_BloodALLL2_R1 1
Human_BloodALLL2_R2 1
Human_BloodALLL2_R3 1
Human_BloodHealthy_R1 1
Human_BloodHealthy_R3 1
Human_BloodHealthy_R4 1
Human_BloodHealthy_R5 1
Human_Esophagus_N11 0
Human_Esophagus_N12 0
Human_Esophagus_N13 0
Human_Esophagus_N2  0
Human_Esophagus_N6  0
Human_Esophagus_N7  0
Human_Esophagus_N8  0

Commands:

radmeth regression -o tmp.txt -v -f is_blood design.txt input.meth
radmeth adjust -bins 1:200:1 tmp1.txt >tmp-adjusted.txt
awk '$7 <= 0.01 && 7 != "-1-1"' tmp-adjusted.txt >tmp-significant.txt
radmeth merge -p 0.01 tmp-significant.txt >out.bed
awk '{sum += $3 - $2} END {print sum}' out.bed 

result: 2803905886 (which is almost the size of the human genome).

Probably not the desired behavior since many CpGs are expected to be constant and "break" the genomic regions.

guilhermesena1 commented 2 years ago

That was my mistake. We don't need the awk step in the middle, and this kind of filtering results in only keeping significant CpGs, which merge then just combines together into a whole chromosome (which is expected behavior)