marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
660 stars 179 forks source link

How can I optimize Canu v2.2 for the SGE Grid, Canu v2.2 crashes when there multiple hosts on SGE #2281

Closed alvizain786 closed 10 months ago

alvizain786 commented 11 months ago

Hi All,

I would like to thank the team for the previous help that I received my other question concerning duplications. That really helped a lot.

I have unrelated question concerning Canu. I am using Canu on SGE where I have two distinct nodes.

There is one node that has one host with plenty of cores, but is has lower speeds with more RAM per core. I am about assemble genomes without a problem on this host. But it is taking a long time when I have to do multiple de-novo assemblies i.e. over 2 months for some Pacbio 2 Hifi data sets [3 GB (60X theoretical coverage with slightly more resources) and 1.3 GB (25X theoretical coverage) with less resources].

I have another node that has 15 hosts with more CPU speed with slightly less RAM. This is where things crash all the time.

My first question:

When I give Canu 6 threads with plenty of RAM. I still get something like this for Grid resources. How can I make more use of more CPUs and the RAM for each step? Is there a command that I should add? How can I optimize Canu to use more resources and complete the de-novo assemblies faster?

Command:

useGrid=true GridEngineResourceOption=-pe $node.name 6 -l mem_free=60G gridEngineArrayOption=-t ARRAY_JOBS -tc 6

-- Grid: meryl 12.000 GB 4 CPUs (k-mer counting) -- Grid: hap 8.000 GB 4 CPUs (read-to-haplotype assignment) -- Grid: cormhap 6.000 GB 4 CPUs (overlap detection with mhap) -- Grid: obtovl 4.000 GB 4 CPUs (overlap detection) -- Grid: utgovl 4.000 GB 4 CPUs (overlap detection) -- Grid: cor -.--- GB 4 CPUs (read correction) -- Grid: ovb 4.000 GB 1 CPU (overlap store bucketizer) -- Grid: ovs 8.000 GB 1 CPU (overlap store sorting) -- Grid: red 16.000 GB 4 CPUs (read error detection) -- Grid: oea 8.000 GB 1 CPU (overlap error adjustment) -- Grid: bat 16.000 GB 4 CPUs (contig construction with bogart) -- Grid: cns -.--- GB 4 CPUs (consensus)

Question 2:

When I try the node with multiple hosts, then it always crashes at the meryl step. Any thoughts as what I can do make it work?

Error:

Failed to submit compute jobs.

Failed at /path/to/Canu_v2/bin/../lib/site_perl/canu/Execution.pm line 1259. CRASH: canu::Execution::submitOrRunParallelJob('sample.hifireads', 'meryl', 'unitigging/0-mercounts', 'meryl-count', 1) called at /path/to/Canu_v2/bin/../lib/site_perl/canu/Meryl.pm line 847 CRASH: canu::Meryl::merylCountCheck('sample.hifireads', 'utg') called at /path/to/Canu_v2/bin/canu line 1117 CRASH: CRASH: Last 50 lines of the relevant log file (unitigging/0-mercounts/meryl-count.jobSubmit-01.out): CRASH: CRASH: Unable to run job: denied: host "actual.local.host.name.and.number" is no submit host.

Also if I would like to use Canu on an AWS server. What is the best strategy to go about that for eukaryotic and microbial assemblies? How much resources should we provide from your experience? Additionally, what settings should we use?

Thank you in advance for the help.

brianwalenz commented 11 months ago

Hi.

  1. When run on a grid, both canu and the grid need to know memory and cpu limits. There is no explicit link between the two - for example, you can submit a job to the grid requesting 6 CPUs (via `-pe $node.name 6') but then run the command with fewer or more compute threads - the grid has no way of enforcing that the command use 6 CPUs.

    In your log, canu has decided to use 4 CPUs and between 4 and 16 GB memory for each job (based on genome size and available hosts in the grid). However, by using (the rather low-level option) GridEngineResourceOption you've explicitly told the grid that your jobs will use 6 CPUs and need 60 GB free memory. With the default value of GridEngineResourceOption, canu would itself fill in the resources required for each job. And so, the way to increase the number of CPUs is, for example, ovlThreads=8, to request that the overlap jobs use 8 compute threads.

    Read through https://canu.readthedocs.io/en/latest/tutorial.html, the second/third section discusses this.

    It is also possible to adjust the job sizes to get more jobs (instead of just using more CPUs for each job). Overlaps are usually the slowest step, and the primary option for fiddling with overlap job sizes is ovlRefBlockSize.

  2. Unable to run job: denied: host "actual.local.host.name.and.number" is no submit host. This is an SGE config issue. Canu requires that execution hosts be able to submit jobs to the grid. Here's a link that should help: https://docs.oracle.com/cd/E19957-01/820-0698/eqqis/index.html

  3. For microbial on AWS, I'd just grab a medium size node -- 8-16 CPUs 16-24GB memory, not that much disk -- and run Canu as a single job. For eukaryotic though, you'll most likely need to setup an SGE or Slurm cluster on AWS. Sadly, I don't know how to do that; searching for 'slurm aws' gave this link: https://docs.aws.amazon.com/parallelcluster/latest/ug/slurm-workload-manager-v3.html

alvizain786 commented 11 months ago

Hi Brian,

Thank you for the insight. I appreciate the help. Would something like this overide all the settings to use specific number of cores and threads?

gridOptions <string=unset>

useGrid=true GridEngineResourceOption=-pe $node.name 6 -l mem_free=60G gridEngineArrayOption=-t ARRAY_JOBS -tc 6 gridOptions -pe $node.name 6 -l mem_free=60G -tc 6

This is based on:

https://canu.readthedocs.io/en/latest/parameter-reference.html

Thank you for SGE information. I will take a deeper look into it and hopefully it will work.

I am more familiar with SGE than SLURM, but I will take a look into as well. Hopefully I can get something set up.

skoren commented 10 months ago

No, the resource options aren't connected to memory/threads actually used, as @brianwalenz said above. I don't see the difference between the initial command and what you posted above. Generally, you don't want to overwrite any of the grid request options and let Canu handle it for you.

You can set Threads for each step, the parameter reference lists all the available options. You can also set minThreads=6. In general, on the grid, the cores per each job doesn't matter as much as available total CPUs. Canu will submit multiple jobs each using 4 cores (w/the default config above) so multiple jobs can share a node. I wouldn't set memory as it is defined by data structures and having more memory committed won't speed anything up.

skoren commented 10 months ago

Idle, answered.