Canu failed - Githubissues

moudixtc commented 7 years ago

Hi, I just started using canu, and I'm sorry if this is something obvious, but please help me understand what went wrong.

I'm running canu v1.5 on a grid setup using SGE, which is bootstrapped by cfncluster on AWS.

Got the following error when running the command canu -p F_vert2 -d F_vert2_auto genomeSize=50m -pacbio-raw /shared/filtered_subreads.fasta gridEngineMemoryOption="-l h_vmem=MEMORY" gridEngineThreadsOption="-pe make THREADS"

-- Canu release v1.5
-- Detected Java(TM) Runtime Environment '1.8.0_131' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 4' (from 'gnuplot') and image format 'png'.
-- Detected 8 CPUs and 31 gigabytes of memory.
-- Detected Sun Grid Engine in '/opt/sge/default'.
-- User supplied Grid Engine environment '-pe make THREADS'.
-- User supplied Grid Engine consumable '-l h_vmem=MEMORY'.
-- 
-- Found   1 host  with   2 cores and    7 GB memory under Sun Grid Engine control.
-- Found   2 hosts with   8 cores and   31 GB memory under Sun Grid Engine control.
--
-- Run under grid control using    7 GB and   2 CPUs for stage 'meryl'.
-- Run under grid control using   13 GB and   8 CPUs for stage 'mhap (cor)'.
-- Run under grid control using    7 GB and   2 CPUs for stage 'overlapper (obt)'.
-- Run under grid control using    7 GB and   2 CPUs for stage 'overlapper (utg)'.
-- Run under grid control using    7 GB and   2 CPUs for stage 'falcon_sense'.
-- Run under grid control using    3 GB and   1 CPU  for stage 'ovStore bucketizer'.
-- Run under grid control using    8 GB and   1 CPU  for stage 'ovStore sorting'.
-- Run under grid control using    6 GB and   2 CPUs for stage 'read error detection'.
-- Run under grid control using    2 GB and   1 CPU  for stage 'overlap error adjustment'.
-- Run under grid control using   31 GB and   8 CPUs for stage 'bogart'.
-- Run under grid control using    4 GB and   2 CPUs for stage 'GFA alignment and processing'.
-- Run under grid control using   31 GB and   8 CPUs for stage 'consensus'.
--
-- Generating assembly 'F_vert2' in '/shared/F_vert2_auto'
--
-- Parameters:
--
--  genomeSize        50000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0450 (  4.50%)
--
--
-- BEGIN CORRECTION
--
-- Meryl finished successfully.
-- Finished stage 'merylCheck', reset canuIteration.
--
-- WARNING: gnuplot failed; no plots will appear in HTML output.
--
----------------------------------------
--
-- WARNING: gnuplot failed; no plots will appear in HTML output.
--
----------------------------------------
Use of uninitialized value $error[0] in join or string at /usr/share/perl/5.18/Carp.pm line 301.
================================================================================
Please panic.  Canu failed, and it shouldn't have.

Stack trace:

 at /home/ubuntu/canu-1.5/Linux-amd64/bin/lib/canu/Meryl.pm line 668.
    canu::Meryl::merylProcess('F_vert2', 'cor') called at /home/ubuntu/canu-1.5/Linux-amd64/bin/canu line 536

Last few lines of the relevant log file (correction/0-mercounts/F_vert2.ms16.histogram.info):

merylStreamReader()-- ERROR: ./F_vert2.ms16.mcidx is not a merylStream index file!
merylStreamReader()-- ERROR: ./F_vert2.ms16.mcdat is not a merylStream data file!

Canu release v1.5 failed with:
  didn't find any mers?

Here is the output of correction/0-mercounts/F_vert2.ms16.histogram.info:

merylStreamReader()-- ERROR: ./F_vert2.ms16.mcidx is not a merylStream index file!
merylStreamReader()-- ERROR: ./F_vert2.ms16.mcdat is not a merylStream data file!

Found some errors in correction/0-mercounts/meryl.1.out

Computing 2 segments using 2 threads and 5734MB memory (3593MB if in one batch).
  numMersActual      = 5321595752
  mersPerBatch       = 3951034368
  basesPerBatch      = 2674092031
  numBuckets         = 134217728 (27 bits)
  bucketPointerWidth = 32
  merDataWidth       = 5
Computing segment 1 of 2.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
Computing segment 2 of 2.
 Allocating 512MB for bucket pointer table (32 bits wide).
 Allocating 512MB for counting the size of each bucket.
 Counting mers in buckets: 2660.75 Mmers -- 16.75 Mmers/second
 Creating bucket pointers.
 Releasing 512MB from counting the size of each bucket./second
 Allocating 1593MB for mer storage (5 bits wide).
 Counting mers in buckets: 2660.85 Mmers -- 16.48 Mmers/second
 Creating bucket pointers.
 Releasing 512MB from counting the size of each bucket./second
 Allocating 1593MB for mer storage (5 bits wide).
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Can fit 7902068736 mers into table with prefix of 27 bits, using 5734.000MB (0.000MB for positions)
Aborted (core dumped)

Also some errors in correction/0-mercounts/F_vert2.ms16.estMerThresh.err

Failed with 'Segmentation fault'; backtrace (libbacktrace):
AS_UTL/AS_UTL_stackTrace.C::102 in _Z17AS_UTL_catchCrashiP7siginfoPv()
(null)::0 in (null)()
(null)::0 in (null)()
meryl/estimate-mer-threshold.C::113 in _Z13loadHistogramP8_IO_FILERmS1_S1_RjRPj()
meryl/estimate-mer-threshold.C::199 in main()
(null)::0 in (null)()
(null)::0 in (null)()
Segmentation fault (core dumped)

skoren commented 7 years ago

The canu job is running out of memory, usually h_vmem is not the correct parameter because it is not consumable and also isn't scaled by threads. Since canu divides the memory by the threads requested it would be under-requesting memory it needs. You can check your configuration using the qconf command:

#name               shortcut   type        relop requestable consumable default  urgency 
#----------------------------------------------------------------------------------------
h_vmem              h_vmem     MEMORY      <=    YES         NO         0        0
mem_free            mf         MEMORY      <=    YES         YES        0        0

If you have mem_free (or another requestable/consumable) memory option I would use that for MEMORY instead of h_vmem.

moudixtc commented 7 years ago

Thank you for the quick reply. It seems like it doesn't have any consumable memory though...

ubuntu@ip-10-30-3-219:/shared/F_vert2_auto$ qconf -sc | grep MEMORY
h_core              h_core     MEMORY    <=    YES         NO         0        0
h_data              h_data     MEMORY    <=    YES         NO         0        0
h_fsize             h_fsize    MEMORY    <=    YES         NO         0        0
h_rss               h_rss      MEMORY    <=    YES         NO         0        0
h_stack             h_stack    MEMORY    <=    YES         NO         0        0
h_vmem              h_vmem     MEMORY    <=    YES         NO         0        0
mem_free            mf         MEMORY    <=    YES         NO         0        0
mem_total           mt         MEMORY    <=    YES         NO         0        0
mem_used            mu         MEMORY    >=    YES         NO         0        0
s_core              s_core     MEMORY    <=    YES         NO         0        0
s_data              s_data     MEMORY    <=    YES         NO         0        0
s_fsize             s_fsize    MEMORY    <=    YES         NO         0        0
s_rss               s_rss      MEMORY    <=    YES         NO         0        0
s_stack             s_stack    MEMORY    <=    YES         NO         0        0
s_vmem              s_vmem     MEMORY    <=    YES         NO         0        0
swap_free           sf         MEMORY    <=    YES         NO         0        0
swap_rate           sr         MEMORY    >=    YES         NO         0        0
swap_rsvd           srsv       MEMORY    >=    YES         NO         0        0
swap_total          st         MEMORY    <=    YES         NO         0        0
swap_used           su         MEMORY    >=    YES         NO         0        0
virtual_free        vf         MEMORY    <=    YES         NO         0        0
virtual_total       vt         MEMORY    <=    YES         NO         0        0
virtual_used        vu         MEMORY    >=    YES         NO         0        0

brianwalenz commented 7 years ago

Looking through the cfncluster docs, they support slurm. Canu works quite well with slurm. Can you use that?

I'm not at all familiar with cfncluster. A little searching hints that some people are using 'post_install' to further tune the SGE configuration. It's fairly easy to add memory tracking, but I'd need to dig out my notes to remember how.

The final option is to configure canu to change the minimum memory needed for specific components. The problem seems to be jobs getting scheduled on the smaller node, so merylMemory=16g would prevent this. I'd also suggest ovlThreads=4 to keep overlapper off that node too.

Or, I suspect just getting rid of that smaller node would solve the problem too.

moudixtc commented 7 years ago

So I tried using `gridEngineMemoryOption="-l mem_free=MEMORY" instead, and it got through the initial steps but again failed somewhere due to an out of memory issue. Then I switched to use slurm, and it worked out of the box. Thank you for the help!

marbl / canu

Canu failed #538