marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
653 stars 179 forks source link

best setting to increase walltime for unitigging - Overlap jobs failed #1207

Closed thsyd closed 5 years ago

thsyd commented 5 years ago

Hi, I hope it's ok for me to ask for some guidance. I'm assembling an apox 5gb bacterial genome. Input Nanopore rapid barcoding kit, run on Minion. Deepbinner -> albacore -> porechop (trim only). Approximately 50x coverage. nano stats:

General summary:
Mean read length: 9446.6 Mean read quality: 8.9 Median read length: 6173.0 Median read quality: 9.1 Number of reads: 28081 Read length N50: 13751 Total bases: 265269746

using canu 1.8 installed last week in conda via the bioconda repository. by

conda install -C bioconda canu=1.8

canu -version gives

> Canu snapshot (snapshot)

Submitting to our HPC (running a unix based system) code exerpt below (initally run with 12 threads and 20gb RAM. The second time I increased to 28 threads and 120gb RAM to see if the error was due to running out of memory)

### Number of nodes
#PBS -l nodes=1:ppn=28
### Memory
#PBS -l mem=120gb
### Requesting time - format is <days>:<hours>:<minutes>:<seconds> (here, 12 hours)
#PBS -l walltime=18:00:00

canu \
    -p $ISOLATE_NAME.canu.ont \
    -d canu_ont \
    genomeSize=5m \
    -nanopore-raw $ONT_FILT

Canu.out:

Found perl:
   /bin/perl
   This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi

Found java:
   /bin/java
   openjdk version "1.8.0_171"

Found canu:
   /home/people/thosyd/.conda/envs/t_conda18/libexec/bin/canu
   Canu snapshot (snapshot)

-- Canu snapshot (snapshot)
--
-- CITATIONS
--
-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
-- Genome Res. 2017 May;27(5):722-736.
-- http://doi.org/10.1101/gr.215087.116
-- 
-- Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM.
-- De novo assembly of haplotype-resolved genomes with trio binning.
-- Nat Biotechnol. 2018
-- https//doi.org/10.1038/nbt.4277
-- 
-- Read and contig alignments during correction, consensus and GFA building use:
--   \u0160o\u0161ic M, \u0160ikic M.
--   Edlib: a C/C\u2009++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
-- 
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_171' (from 'java') with -d64 support.
--
-- WARNING:
-- WARNING:  Failed to run gnuplot using command 'gnuplot'.
-- WARNING:  Plots will be disabled.
-- WARNING:
--
-- Detected 28 CPUs and 126 gigabytes of memory.
-- Detected PBS/Torque '6.1.0' with 'pbsnodes' binary in /usr/local/bin/pbsnodes.
-- Detecting PBS/Torque resources.
-- 
-- Found  16 hosts with  32 cores and 1007 GB memory under PBS/Torque control.
-- Found   1 host  with  64 cores and  944 GB memory under PBS/Torque control.
-- Found 360 hosts with  28 cores and  125 GB memory under PBS/Torque control.
-- Found   1 host  with  64 cores and 1007 GB memory under PBS/Torque control.
-- Found   1 host  with  48 cores and  503 GB memory under PBS/Torque control.
-- Found   4 hosts with  56 cores and  125 GB memory under PBS/Torque control.
-- Found   2 hosts with  28 cores and  110 GB memory under PBS/Torque control.
--
--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl      8 GB    4 CPUs  (k-mer counting)
-- Grid:  hap        8 GB    4 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap    6 GB    4 CPUs  (overlap detection with mhap)
-- Grid:  obtovl     4 GB    4 CPUs  (overlap detection)
-- Grid:  utgovl     4 GB    4 CPUs  (overlap detection)
-- Grid:  ovb        4 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs        8 GB    1 CPU   (overlap store sorting)
-- Grid:  red        8 GB    4 CPUs  (read error detection)
-- Grid:  oea        4 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       16 GB    4 CPUs  (contig construction with bogart)
-- Grid:  gfa        8 GB    4 CPUs  (GFA alignment and processing)
--
-- In 'ATCC25285.canu.ont.seqStore', found Nanopore reads:
--   Raw:        28081
--   Corrected:  12858
--   Trimmed:    12858
--
-- Generating assembly 'ATCC25285.canu.ont' in '/home/projects/cu_10128/data/gen/assemblyvalidation/ATCC/canu_ont'
--
-- Parameters:
--
--  genomeSize        5000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.1200 ( 12.00%)
--    utgOvlErrorRate 0.1200 ( 12.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.1200 ( 12.00%)
--    utgErrorRate    0.1200 ( 12.00%)
--    cnsErrorRate    0.2000 ( 20.00%)
--
--
-- BEGIN ASSEMBLY
--
--
-- Overlap jobs failed, tried 2 times, giving up.
--   job unitigging/1-overlapper/001/000002.ovb FAILED.
--

ABORT:
ABORT: Canu snapshot (snapshot)
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

I restarted canu by submitting the same job as before. Same error.

The final line of utgovl_ATCC25285.canu.ont.o22434458-2 (below) indicates the problem, yes? "PBS: job killed: walltime 3631 exceeded limit 3600"

I do not suppose that its the total walltime that is exeeded. The job is ended quite quickly. What is the best way to ask for more walltime for the jobs submitted as part of the unitigging step. Do I set it though gridOptions

or

gridOptionsCNS ?

Found perl:
   /bin/perl
   This is perl 5, version 16, subversion 3 (v5.16.3) built for x86_64-linux-thread-multi

Found java:
   /bin/java
   openjdk version "1.8.0_171"

Found canu:
   /home/people/thosyd/.conda/envs/t_conda18/libexec/bin/canu
   Canu snapshot (snapshot)

Running job 2 based on PBS_ARRAYID=2 and offset=0.
mkdir: cannot create directory './001': File exists

STRING_NUM_BITS          31
OFFSET_BITS              31
STRING_NUM_MASK          2147483647
OFFSET_MASK              2147483647
MAX_STRING_NUM           2147483647

Hash_Mask_Bits           22
bgnHashID                11550
bgnHashID                22864
Max_Hash_Data_Len        80031300
Max_Hash_Load            0.800000
Kmer Length              22
Min Overlap Length       500
Max Error Rate           0.120000
Min Kmer Matches         0

Num_PThreads             4

sizeof(Hash_Bucket_t)    216
sizeof(Check_Vector_t)   4
sizeof(Hash_Frag_Info_t) 4

HASH_TABLE_SIZE          4194304

hash table size:         864 MB
hash check array         16 MB
string info              0 MB
string start             0 MB

Initializing 4 work areas.
Build_Hash_Index from 11550 to 22864

Found 5158 reads with length 80031300 to load; 6157 skipped by being too short; 0 skipped per library restriction
HASH LOADING STOPPED: curID           22864 out of        22864
HASH LOADING STOPPED: length       80031300 out of     80031300 max.
HASH LOADING STOPPED: entries      16967991 out of     70464307 max (load 19.26).

Read 877 kmers to mark to skip

Range: 1-22864.  Store has 28081 reads.
Chunk: 715 reads/thread -- (G.endRefID=22864 - G.bgnRefID=1) / G.Num_PThreads=4 / 8

Starting 1-22864 with 715 per thread

Thread 00 processes reads 1-715
Thread 03 processes reads 2146-2860
Thread 01 processes reads 716-1430
Thread 02 processes reads 1431-2145
Thread 01 writes    reads 716-1430 (9563 overlaps 9563/20310/0 kmer hits with/without overlap/skipped)
Thread 01 processes reads 2861-3575
Thread 00 writes    reads 1-715 (11354 overlaps 11354/21033/0 kmer hits with/without overlap/skipped)
Thread 00 processes reads 3576-4290
Thread 03 writes    reads 2146-2860 (10422 overlaps 10422/17672/0 kmer hits with/without overlap/skipped)
Thread 03 processes reads 4291-5005
Thread 02 writes    reads 1431-2145 (11205 overlaps 11205/21363/0 kmer hits with/without overlap/skipped)
Thread 02 processes reads 5006-5720
Thread 00 writes    reads 3576-4290 (10724 overlaps 10724/19870/0 kmer hits with/without overlap/skipped)
Thread 00 processes reads 5721-6435
Thread 01 writes    reads 2861-3575 (9723 overlaps 9723/18035/0 kmer hits with/without overlap/skipped)
Thread 01 processes reads 6436-7150
Thread 03 writes    reads 4291-5005 (9986 overlaps 9986/17574/0 kmer hits with/without overlap/skipped)
Thread 03 processes reads 7151-7865
Thread 02 writes    reads 5006-5720 (10902 overlaps 10902/19996/0 kmer hits with/without overlap/skipped)
Thread 02 processes reads 7866-8580
Thread 03 writes    reads 7151-7865 (10865 overlaps 10865/18016/0 kmer hits with/without overlap/skipped)
Thread 03 processes reads 8581-9295
Thread 00 writes    reads 5721-6435 (11075 overlaps 11075/19614/0 kmer hits with/without overlap/skipped)
Thread 00 processes reads 9296-10010
Thread 01 writes    reads 6436-7150 (10914 overlaps 10914/21688/0 kmer hits with/without overlap/skipped)
Thread 01 processes reads 10011-10725
Thread 02 writes    reads 7866-8580 (10811 overlaps 10811/20774/0 kmer hits with/without overlap/skipped)
Thread 02 processes reads 10726-11440
Thread 00 writes    reads 9296-10010 (10722 overlaps 10722/19132/0 kmer hits with/without overlap/skipped)
Thread 00 processes reads 11441-12155
Thread 01 writes    reads 10011-10725 (10416 overlaps 10416/21007/0 kmer hits with/without overlap/skipped)
Thread 01 processes reads 12156-12870
Thread 03 writes    reads 8581-9295 (11023 overlaps 11023/17959/0 kmer hits with/without overlap/skipped)
Thread 03 processes reads 12871-13585
Thread 02 writes    reads 10726-11440 (9493 overlaps 9493/16944/0 kmer hits with/without overlap/skipped)
Thread 02 processes reads 13586-14300
Thread 00 writes    reads 11441-12155 (10119 overlaps 10119/20201/0 kmer hits with/without overlap/skipped)
Thread 00 processes reads 14301-15015
Thread 03 writes    reads 12871-13585 (9154 overlaps 9154/17466/0 kmer hits with/without overlap/skipped)
Thread 03 processes reads 15016-15730
Thread 01 writes    reads 12156-12870 (10177 overlaps 10177/19461/0 kmer hits with/without overlap/skipped)
Thread 01 processes reads 15731-16445
Thread 02 writes    reads 13586-14300 (8597 overlaps 8597/17630/0 kmer hits with/without overlap/skipped)
Thread 02 processes reads 16446-17160
Thread 00 writes    reads 14301-15015 (7761 overlaps 7761/13297/0 kmer hits with/without overlap/skipped)
Thread 00 processes reads 17161-17875
Thread 03 writes    reads 15016-15730 (7098 overlaps 7098/13499/0 kmer hits with/without overlap/skipped)
Thread 03 processes reads 17876-18590
Thread 01 writes    reads 15731-16445 (6488 overlaps 6488/10004/0 kmer hits with/without overlap/skipped)
Thread 01 processes reads 18591-19305
Thread 02 writes    reads 16446-17160 (5501 overlaps 5501/9977/0 kmer hits with/without overlap/skipped)
Thread 02 processes reads 19306-20020
Thread 00 writes    reads 17161-17875 (5163 overlaps 5163/9609/0 kmer hits with/without overlap/skipped)
Thread 00 processes reads 20021-20735
Thread 01 writes    reads 18591-19305 (3435 overlaps 3435/5519/0 kmer hits with/without overlap/skipped)
Thread 01 processes reads 20736-21450
Thread 02 writes    reads 19306-20020 (3230 overlaps 3230/6307/0 kmer hits with/without overlap/skipped)
Thread 02 processes reads 21451-22165
Thread 00 writes    reads 20021-20735 (2576 overlaps 2576/4820/0 kmer hits with/without overlap/skipped)
Thread 00 processes reads 22166-22864
Thread 00 writes    reads 22166-22864 (358 overlaps 358/647/0 kmer hits with/without overlap/skipped)
Thread 02 writes    reads 21451-22165 (1067 overlaps 1067/1846/0 kmer hits with/without overlap/skipped)
Thread 03 writes    reads 17876-18590 (4751 overlaps 4751/8636/0 kmer hits with/without overlap/skipped)
=>> PBS: job killed: walltime 3631 exceeded limit 3600

Remove this text before submitting your issue!

skoren commented 5 years ago

Yes, the job exceed the default wall time, which is 1 hr. See the FAQ: https://canu.readthedocs.io/en/latest/faq.html#how-do-i-run-canu-on-my-slurm-sge-pbs-lsf-torque-system. You can add a gridOptions="-l walltime=24:00:00" as an example to request 24 hrs.