refresh-bio / KMC

Fast and frugal disk based k-mer counter
256 stars 73 forks source link

Resource specifications for stage 2 #123

Closed KamilSJaron closed 5 years ago

KamilSJaron commented 5 years ago

Hi there, thanks for making KMC, it's unbelievably useful and convenient tool.

However, I do have troubles in running it on a cluster. I need to specify memory that the process will need and the threshold I specify to KMC via parameter -m does not appear to be respected at all. When I run with 48G memory limit (leaving 12G of memory margin)

kmc -k21 -t32 -m36 -ci1 -cs100000 @filelist my_cool_database tmpdir

The job gets killed: mem 50337216kb exceeded limit 50331648kb. The process runs for a while, and it reports Stage 2: 0% before it gets killed. Is there a way how to figure out how much memory will stage 2 take?

Cheers, Kamil

marekkokot commented 5 years ago

Hi,

thanks for using KMC. I suspect your input is really large, is that the case? -m is a strong suggestion for KMC. In the case of large inputs, however, KMC may need to use more memory. The largest dataset we used in experiments was 736.4 Gbases (614.1 GB gzipped fastq) and KMC3 needed 33GB of RAM. Is your input of similar size? Nevertheless, one may force KMC to always follow memory limit with -sm (strict memory) parameter. Let me know if it helps. In general, there is a way to figure out max memory requirement for stage two, but it is no straightforward and to print it you would need to recompile KMC with some preprocessor flag enabled. I hope -sm will solve your problem. -sm is not enabled by default because in most cases it is not needed and it may slow down computations a bit. From your log, it seems that KMC would need more than yours 48GB of memory which I have never seen before, so if your input is not that huge maybe there is some bug. I will really appreciate if you could tell me something about your input data.

Thanks again for using KMC :)

KamilSJaron commented 5 years ago

Oh yes, it's quite a dataset, approximately 1.2 Tbases, so I suspect it's not a bug. Maybe I should have mentioned that ^^. I will try to use -sm and get back how it goes. Thanks for such a prompt and friendly response.

To be honest, I am not just a regular user of KMC. It is even used within the tool I am co-developing called smudgeplot. It's a kmer spectra approach to guess the ploidy and visualize genome structure. Hence, I really do appreciate KMC (and I thought you might be interested to know what is KMC used for). Seriously, thanks for developing it!

marekkokot commented 5 years ago

Wow, it is a really big dataset, I hope -sm will be able to handle it efficiently. If any problem occurs please inform me, as I would like to fix possible issues. There may be also other option to handle such a big database with the cooperation of kmc and kmc_tools, but it would be probably less performant, so I will not describe it for now. Out of curiosity, have you checked if Jellyfish is able to handle your big database?

I am really glad that KMC is used in such a project. I can see that kmc_tools is also used, it is always nice to see that your tools are used in some interesting projects. If you have any suggestion of enhancements of KMC or kmc_tools please let me know (though, probably I will not be able to implement it quickly due to a bunch of other work).

KamilSJaron commented 5 years ago

Somehow even with -sm it goes over the specified threshold. I wrote -sm 60 but the job was killed for using 67108864kb. Hmmm, Thinking whether it will be easier just to ask to lots lots of memory. How much you think it will be safe?

marekkokot commented 5 years ago

Could you send me the whole command line you used? It should be something like:

kmc -k21 -t32 -m36 -sm -ci1 -cs100000 @filelist my_cool_database tmpdir

-sm means for kmc, keep the limit specified with -m as a strict limit, not a suggestion. Independently I will check if -sm works on some smaller dataset with smaller memory limit and I will let you know if there is some issue and I will try to fix it.

In general, the total memory needed on the second stage is known only after the first stage. Let suppose KMC knows it will need 10GB of memory for the second stage and user specifies -m6 than KMC will ignore -m and allocate 10GB. On the other hand if used specified -m36 KMC will be able to process more data at once, and thus produce results faster. So, in this case, asking for a lot of memory will not be helpful :(

Is your dataset public available? Could you send me URLs to download it?

UPDATE: I have checked -sm on smaller dataset with smaller amount of specified memory and it works.

KamilSJaron commented 5 years ago

Ah, I misunderstood the syntax. I executed

kmc -k21 -t32 -sm60 -ci1 -cs100000 @files out tmpdir

I will try to fix it.

Unfortunately it's not and I am not even sure how much I can talk about it (it's not my project, I am just doing genome profiling for them). Sorry, I was a bit in hurry when I was reporting the failure, I should have include these details right away.

KamilSJaron commented 5 years ago

With corrected syntax the job have finished :-). Thank you a lot for your help.

marekkokot commented 5 years ago

Thanks for the feedback, I am glad it works :) Out of curiosity, how what were the times (stage 1, 2 and 3) reported by kmc?

KamilSJaron commented 5 years ago

Not terribly much :-) (cores -t32, memory: -m60 -sm)

1st stage: 22819.5s
2nd stage: 24585.8s
3rd stage: 278.353s
marekkokot commented 5 years ago

Ok, thanks. In case of any further doubts don't hesitate to ask :) Thanks again for using KMC.