extract_CpG_data.py issue

jsh58 / DMRfinder

Identifying differentially methylated regions from MethylC-seq (bisulfite-sequencing) data

MIT License

26 stars 8 forks source link

extract_CpG_data.py issue #2

Closed jsh58 closed 7 years ago

jsh58 commented 7 years ago

memory-saving version of the script, when analyzing a sorted SAM

jsh58 commented 7 years ago

Sergei (@svrever),

I have now revised extract_CpG_data.py with a memory-saving option -b. It is available here. It requires that the SAM be coordinate-sorted in advance.

Please let me know if you encounter any difficulties with it.

John Gaspar

svrever commented 7 years ago

Hi John, thanks for the updated version of the script. I started it on the 35G .bam file with the -b option. It is running for over a day now and so far is at ~7%, and the memory usage is really low. I will let you know when it finishes.

Sergei

jsh58 commented 7 years ago

Sergei,

OK, glad to hear the memory usage is low, but it seems to be running very slowly. It should be producing output as it runs, can you check that and see if it makes sense?

John Gaspar

svrever commented 7 years ago

The output looks fine, although occasionally you can find what looks like an unmerged CpG: . . . . . Scaffold69178 70 71 100.000000 1 0 Scaffold69178 92 93 100.000000 1 0 Scaffold69178 166 167 5.555556 1 17 Scaffold69178 186 187 5.555556 2 34 Scaffold69178 197 198 0.000000 0 41 Scaffold69178 203 204 9.302326 4 39 Scaffold69178 205 206 0.000000 0 1 Scaffold69178 229 230 0.000000 0 2 Scaffold69178 283 284 15.789474 6 32 Scaffold69178 285 286 21.621622 8 29 Scaffold69178 298 299 9.375000 3 29 Scaffold69178 303 304 7.407407 2 25 Scaffold69178 307 308 9.090909 2 20 . . . . . but the occurrence is fairly low. Multi-threading option would help to speed things up?

Sergei

jsh58 commented 7 years ago

Sergei,

I do not think those are unmerged CpGs. If you check the genome sequence, it will probably be 'CGCG' at those positions.

I tested the script (with -b) on a 32Gb BAM file and it finished in about 12 hours. This was using samtools 1.5 and python 2.7.12, maybe your versions are different.

John Gaspar

svrever commented 7 years ago

You are right, there are GCGC's. And slower speed is maybe due to the multiple scaffolds, or the hardware? I am using the same python version as you and samtools 1.3.1

Sergei

jsh58 commented 7 years ago

Sergei,

The multiple scaffolds will cause a little slow down, since each read's reference sequence is verified by the script. But that shouldn't take much time, even over millions of reads. It is probably a hardware issue. Hopefully it will finish in a day or two.

John Gaspar

svrever commented 7 years ago

Hi John,

the script is running for a week now, and it is about 30% through. Probably, another 2-3 weeks and it will be done.
:-)

Sergei

jsh58 commented 7 years ago

Sergei,

OK, that is rather absurdly slow. I cannot even guess what may be going on with your datasets and your computer. Even if it were running entirely in swap memory, it shouldn't be that slow.

John Gaspar

svrever commented 7 years ago

Computer is dell precision 7910, 8core 2.4Ghz, Oracle VM - it usually runs pretty well almost everything, some scripts do crash it though. 4-5G of RAM and 100% of one CPU core is used, and swap memory is not used at all. Well, I guess if it is running on your setup, something should be wrong with mine.

Sergei

jsh58 commented 7 years ago

Sergei,

Well, it could be the VM, but I'm not in a position to judge that. Regardless, I have released a new version of DMRfinder (0.3) here on GitHub, with the memory-saving options. I do not think there is more I can do here, so I'll close this issue.

Good luck,

John Gaspar