marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
653 stars 179 forks source link

CANU failed to restart at ovlStoreBuild command #945

Closed prashantshingate closed 6 years ago

prashantshingate commented 6 years ago

Hello CANU team,

Thank you for creating this great assembler and the support you provide to users. I am using CANU v1.7 to assemble a genome ~1Gb in size and with ~100X PacBio raw-reads (~50X from sequel and ~50X from RSII).

I am using a supercomputer for this purpose.

There is a PBSPro scheduler, however there are lots of restrictions to submit a job. For example, memory per node is restricted to 96GB though real memory is 126GB, walltime for all jobs using (queue=normal/largemem) multiple nodes is restricted to 24 hours only while for single node job (queue=long) it is restricted to 120 hours. Also, walltime parameter is mandatory during job submission.

Due to such restrictions, I failed to start the job using Grid (it failed at 1-mercount stage itself). Hence I started a job without Grid, first on largemem queue (cores=48, memory=2TB, walltime=24:00:00) and then for ovStoreBuild stage on long queue (cores=24, memory=96GB, walltime=120:00:00).

CANU command used: ~/scratch/softwares/canu-1.7/Linux-amd64/bin/canu useGrid=false -p es -d es-pacbio2 maxMemory=96g maxThreads=24 genomeSize=1g -pacbio-raw sequel_rs2_all.fq

It worked great on the system. However, it stucked at ovlStoreBuild stage. This stage could not finish within 120 hours and when I resubmit CANU, I get following error along with last few lines of es.ovlStore.err.

Is there any way I can configure CANU to resume from the point it stopped in ovStoreBuild stage? Sorry for long message. Thank you once again for your help and time. Best Regards, Prashant Shingate

`-- Canu 1.7
--
--
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_91' (from '/app/java/default/bin/java').
-- Detected gnuplot version '5.0 patchlevel 3' (from 'gnuplot') and image format 'png'.
-- Detected 24 CPUs and 126 gigabytes of memory.
-- Limited to 96g gigabytes from maxMemory option.
-- Limited to 24 CPUs from maxThreads option.
-- Detected PBSPro '14.2.4.20171012010902' with 'pbsnodes' binary in /opt/pbs/bin/pbsnodes.
-- Grid engine disabled per useGrid=false option.
--
--                            (tag)Concurrency
--                     (tag)Threads          |
--            (tag)Memory         |          |
--        (tag)         |         |          |     total usage     algorithm
--        -------  ------  --------   --------  -----------------  -----------------------------
-- Local: meryl     96 GB   24 CPUs x   1 job     96 GB   24 CPUs  (k-mer counting)
-- Local: cormhap   32 GB   12 CPUs x   2 jobs    64 GB   24 CPUs  (overlap detection with mhap)
-- Local: obtovl    16 GB   12 CPUs x   2 jobs    32 GB   24 CPUs  (overlap detection)
-- Local: utgovl    16 GB   12 CPUs x   2 jobs    32 GB   24 CPUs  (overlap detection)
-- Local: ovb        4 GB    1 CPU  x  24 jobs    96 GB   24 CPUs  (overlap store bucketizer)
-- Local: ovs       32 GB    1 CPU  x   3 jobs    96 GB    3 CPUs  (overlap store sorting)
-- Local: red        8 GB    4 CPUs x   6 jobs    48 GB   24 CPUs  (read error detection)
-- Local: oea        4 GB    1 CPU  x  24 jobs    96 GB   24 CPUs  (overlap error adjustment)
-- Local: bat       96 GB   16 CPUs x   1 job     96 GB   16 CPUs  (contig construction)
-- Local: gfa       16 GB   16 CPUs x   1 job     16 GB   16 CPUs  (GFA alignment and processing)
--
-- In 'es.gkpStore', found PacBio reads:
--   Raw:        27336043
--   Corrected:  0
--   Trimmed:    0
--
-- Generating assembly 'es' in '/scratch/users/astar/imcb/prashan1/es_pacbio/canu/es-pacbio2'
--
-- Parameters:
--
--  genomeSize        1000000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)
--
--
-- BEGIN CORRECTION
--
----------------------------------------
-- Starting command on Fri Jun  8 09:40:28 2018 with 857843.991 GB free disk space

    cd correction
    /home/projects/13000217/softwares/canu-1.7/Linux-amd64/bin/ovStoreBuild \
     -O ./es.ovlStore.BUILDING \
     -G ./es.gkpStore \
     -M 4-32 \
     -L ./1-overlapper/ovljob.files \
     > ./es.ovlStore.err 2>&1

-- Finished on Fri Jun  8 09:42:27 2018 (119 seconds) with 857835.791 GB free disk space
----------------------------------------

ERROR:
ERROR:  Failed with exit code 1.  (rc=256)
ERROR:

ABORT:
ABORT: Canu 1.7
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT:   failed to create the overlap store.
ABORT:
ABORT: Disk space available:  857835.791 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (correction/es.ovlStore.err):
ABORT:
ABORT:`

Last few lines of correction/eshark.ovlStore.err:

  bucket 8090 has 126033986 olaps.
  bucket 8091 has 126261101 olaps.
  bucket 8092 has 126207920 olaps.
  bucket 8093 has 126039083 olaps.
  bucket 8094 has 126097048 olaps.
  bucket 8095 has 126204069 olaps.
  bucket 8096 has 126314633 olaps.
  bucket 8097 has 126216289 olaps.
  bucket 8098 has 126169914 olaps.
  bucket 8099 has 126175448 olaps.
  bucket 8100 has 126396659 olaps.
  bucket 8101 has 126199898 olaps.
  bucket 8102 has 126272586 olaps.
  bucket 8103 has 126038198 olaps.
  bucket 8104 has 126264463 olaps.
  bucket 8105 has 126113782 olaps.
  bucket 8106 has 126040218 olaps.
  bucket 8107 has 126353020 olaps.
  bucket 8108 has 30732616 olaps.
Will sort 126.030 million overlaps per bucket, using 8108 buckets 4.01 GB per bucket.

-- BUCKETIZING --

ERROR:  './es.ovlStore.BUILDING' is a valid ovStore; cannot create a new one.
skoren commented 6 years ago

You can restart if you first remove the es.ovlStore.BUILDING folder. Typically on your sized genome we'd use a parallel store construction but this is disabled off grid to avoid overloading filesystems. You could add ovsMethod=forceparallel to turn it back on since I expect your filesystem could handle the added I/O.

You're running completely off grid which slows down the compute but typically you can restart the run with the same command if it fails again.

prashantshingate commented 6 years ago

Thank you so much for your reply. I will try ovsMethod=forceparallel option and update you about the outcome. Thank you once again for your help.

brianwalenz commented 6 years ago

Reopen if needed.