nanoporetech / modkit

A bioinformatics tool for working with modified bases
https://nanoporetech.com/
Other
137 stars 7 forks source link

Modkit DMR Pair: Process Killed, No Output #103

Open handoko12u opened 9 months ago

handoko12u commented 9 months ago

dmr.log

(modkit) sysadmin@sysadmin:/var/lib/minknow/data/20230907_LSK114_KNF_DNA_Long/4562949_ReRun1/modkit/dmr$ modkit dmr pair -a /var/lib/minknow/data/20230907_LSK114_KNF_DNA_Long/4562949_ReRun1/modkit/bedmethyl.bed.gz --index-a /var/lib/minknow/data/20230907_LSK114_KNF_DNA_Long/4562949_ReRun1/modkit/bedmethyl.bed.gz.tbi -b /var/lib/minknow/data/20231122_LSK114_KNF_DNA_Short/4597620/modkit/methyl.bed.gz --index-b /var/lib/minknow/data/20231122_LSK114_KNF_DNA_Short/4597620/modkit/methyl.bed.gz.tbi -o dmr.bed --ref /home/sysadmin/Downloads/hg38.chromFa/reference_ucsc_complete/hg38.fa --base C --log-filepath dmr.log

creating directory at "" loading sites from input 'a' bedMethyl Killed

I tried modkit dmr pair, the command was supposed to be correct, but it generated no output. When I inspected the logfile, it seemed the process terminated, but it was unclear why, please anyone can help?

Thank you

ArtRand commented 9 months ago

Hello @handoko12u,

The most likely cause is you're running out of memory. Does the infrastructure you're running on have a hard limit? Maybe you could increase it. If you can give me an idea of how much it's using, I can look into ways to decrease the memory usage for modkit dmr.

handoko12u commented 9 months ago

this is my computer limit: sysadmin@sysadmin:~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 255884 max locked memory (kbytes, -l) 65536 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 255884 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

How to check how much limit it require to run the modkit dmr process?

ArtRand commented 9 months ago

@handoko12u,

I don't have a simple way to determine a priori how much memory the analysis will take. Could you tell me how much RAM your machine has with free -h? You could run ps on the same machine to keep track of how much memory the process is using. Or use something like top or htop and look to see if the memory "balloons" until the process is killed.

I'll look into how much memory modkit dmr uses, I have not seen this be a problem on a server with >100GB RAM doing whole genome CpG differential analysis - but I'll take a closer look.

handoko12u commented 9 months ago

Ok thanks, I re-tried it, and please find below the memory utilization. I only have 64 GB RAM.

This is before any process run:

pre-process

This is after I started the modkit process:

first_minute

The memory above 90% and shortly after that process failed:

fail

Please find the log files here:

dmrmulti.log

Thank you and please advice.

ArtRand commented 8 months ago

Hello @handoko12u,

Sorry for being slow to get back. I would recommend sharding the input bedMethyl into genomic regions, (i.e. by chromosome or total length) and running each shard separately. I have some ideas of how to reduce the memory required for this kind of analysis, but it won't be available until the next release. I'll let you know as soon as I have a build ready.

handoko12u commented 8 months ago

Ok thank you @ArtRand will wait for next release.

ArtRand commented 8 months ago

Hello @handoko12u ,

Could you tell me the number of rows and/or the size of the bedMethyl files you're using? /var/lib/minknow/data/20230907_LSK114_KNF_DNA_Long/4562949_ReRun1/modkit/bedmethyl.bed.gz and /var/lib/minknow/data/20231122_LSK114_KNF_DNA_Short/4597620/modkit/methyl.bed.gz as well as the size of the reference genome (or a link)? Thanks, hoping to have a solution to you soon.

handoko12u commented 8 months ago

Hello @ArtRand

My number of rows: sysadmin@sysadmin:/var/lib/minknow/data/20230907_LSK114_KNF_DNA_Long/4562949_ReRun1/modkit$ wc -l bedmethyl.bed 143127464 bedmethyl.bed --> 11.3 GB sysadmin@sysadmin:/var/lib/minknow/data/20231122_LSK114_KNF_DNA_Short/4597620/modkit$ wc -l methyl.bed 141931406 methyl.bed --> size 11.2 GB Reference genome: https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.fa.gz

Thank you

ArtRand commented 7 months ago

@handoko12u,

The latest version of modkit should use less memory when doing single-site DMR. Please give it a try at your convenience.