guoweilong / cgmaptools

toolbox for analysing BS-seq data, advance features in SNV, ASM and DMR
https://cgmaptools.github.io
61 stars 26 forks source link

Can I running cgmaptools parallelly? #35

Open Kennyluo4 opened 4 years ago

Kennyluo4 commented 4 years ago

Hello, I'm trying to use cgmaptools to analyze my methylation data. My data is huge (~60Gb bam file per sample, 9 samples with 2 sets of treatment and 1 set of control). I failed to call the methylation using CGmapFromBAM -Ooption (to remove overlap). This issue has been reported and hope it can be solved soon.

I finally used MethylDackel to call the methylation from bam files and converted them to CGmaps. But each CGmap is still ~40Gb in size. I tried to run methylKit and DSS for the analysis but R is terrible at processing big data, so I turn to CGmaptools. I haven't finished the analysis yet, but seems it still takes a while to processing these samples using CGmapStatCov (not as RAM-consuming as R). Can I run my samples on cgmaptools parallelly by allocating more CPUs?

BTW, seems that I can only merge my replicates to do the differential methylation analysis on cgmaptools. Why not using the replicates information to do the statistics inference. Is there an evaluation/comparison on different statistical models used for the DM analysis? Thanks Ziliang

guoweilong commented 4 years ago

Hi Ziliang,

I haven't got time to fix the problem with CGmapFromBam -O , thus working it without removing overlap will be a runnable way in current stage.

For your second issue, running cgmaptools in parallel by submitting multiple task will be fine.

For the DMR issue, your suggestion is valuable. While there are still issue including balancing the coverages (detectable regions) among samples for low mapped dataset. If you would like to try use replicates information, you are suggested to use cgmaptools mbin then write your own code for constructing a DMR model, such as t test.

Best, Weilong