Job stops after correction with no obvious message explaining why

guibar commented 5 years ago

Hello, I am trying to perform an assembly with some pacbio and some nanopore reads and my job stop quite early on without much explanation.

The command I used is:

canu \
 -p doublon-mix -d doublon-mix \
  stopOnReadQuality=false \
  genomeSize=380m gridEngineThreadsOption='--cpus-per-task THREADS'\
 -pacbio-raw /home/gbarreau/work/melon/doublon/raw/pacbio.1.fasta \
 -pacbio-raw /home/gbarreau/work/melon/doublon/raw/pacbio.2.fasta \
 -pacbio-raw /home/gbarreau/work/melon/doublon/raw/pacbio.3.fasta \
 -pacbio-raw /home/gbarreau/work/melon/doublon/raw/pacbio.4.fasta \
 -nanopore-raw /home/gbarreau/work/melon/doublon/raw/minion.fastq.gz

I am running this on a cluster via Slurm. The slurm file looks like this:

#!/bin/bash
#SBATCH --job-name=canu.doublon-mix
#SBATCH --output=canu.doublon-mix.out
#SBATCH --error=canu.doublon-mix.err
#SBATCH --time=48:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem=128G

The canu.doublon-mix.err is pasted below:

-- Canu 1.8
--
-- CITATIONS
--  ...
--  
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_102' (from 'java') with -d64 support.
-- Detected gnuplot version '4.6 patchlevel 2   ' (from 'gnuplot') and image format 'png'.
-- Detected 64 CPUs and 252 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with 'MaxArraySize' limited to 2500 jobs.
-- 
-- Found  60 hosts with  40 cores and  250 GB memory under Slurm control.
-- Found   1 host  with  64 cores and  125 GB memory under Slurm control.
-- Found   1 host  with  96 cores and 1500 GB memory under Slurm control.
-- Found  48 hosts with  64 cores and  250 GB memory under Slurm control.
--
--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl     15 GB    8 CPUs  (k-mer counting)
-- Grid:  hap       12 GB    8 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   13 GB    8 CPUs  (overlap detection with mhap)
-- Grid:  obtovl     8 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl     8 GB    8 CPUs  (overlap detection)
-- Grid:  ovb        4 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       16 GB    1 CPU   (overlap store sorting)
-- Grid:  red        7 GB    4 CPUs  (read error detection)
-- Grid:  oea        4 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       64 GB    8 CPUs  (contig construction with bogart)
-- Grid:  gfa        8 GB    8 CPUs  (GFA alignment and processing)
--
-- Found both PacBio and Nanopore uncorrected reads in the input files.
--
-- Generating assembly 'doublon-mix' in '/work/gbarreau/melon/doublon/asm/a2/doublon-mix'
--
-- Parameters:
--
--  genomeSize        380000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.1200 ( 12.00%)
--    utgOvlErrorRate 0.1200 ( 12.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.1200 ( 12.00%)
--    utgErrorRate    0.1200 ( 12.00%)
--    cnsErrorRate    0.2000 ( 20.00%)
--
--
-- BEGIN CORRECTION
--
----------------------------------------
-- Starting command on Tue Aug 13 16:19:26 2019 with 363051.166 GB free disk space

    cd .
    /usr/local/bioinfo/src/canu/canu-1.8/Linux-amd64/bin/sqStoreCreate \
      -o ./doublon-mix.seqStore.BUILDING \
      -minlength 1000 \
      ./doublon-mix.seqStore.ssi \
    > ./doublon-mix.seqStore.BUILDING.err 2>&1

-- Finished on Tue Aug 13 16:31:59 2019 (753 seconds) with 362985.995 GB free disk space
----------------------------------------
--
-- In sequence store './doublon-mix.seqStore':
--   Found 2756782 reads.
--   Found 42994899934 bases (113.14 times coverage).
--
--   Read length histogram (one '*' equals 9724.62 reads):
--        0   4999 680724 **********************************************************************
--     5000   9999 577074 ***********************************************************
--    10000  14999 397633 ****************************************
--    15000  19999 294846 ******************************
--    20000  24999 238454 ************************
--    25000  29999 178198 ******************
--    30000  34999 124588 ************
--    35000  39999  86033 ********
--    40000  44999  58979 ******
--    45000  49999  40841 ****
--    50000  54999  27757 **
--    55000  59999  18609 *
--    60000  64999  12024 *
--    65000  69999   7927 
--    70000  74999   5037 
--    75000  79999   3166 
--    80000  84999   1963 
--    85000  89999   1154 
--    90000  94999    690 
--    95000  99999    384 
--   100000 104999    229 
--   105000 109999    154 
--   110000 114999    103 
--   115000 119999     52 
--   120000 124999     41 
--   125000 129999     26 
--   130000 134999     15 
--   135000 139999     15 
--   140000 144999     14 
--   145000 149999      8 
--   150000 154999      8 
--   155000 159999      8 
--   160000 164999     11 
--   165000 169999      3 
--   170000 174999      2 
--   175000 179999      3 
--   180000 184999      0 
--   185000 189999      0 
--   190000 194999      0 
--   195000 199999      1 
--   200000 204999      0 
--   205000 209999      0 
--   210000 214999      3 
--   215000 219999      0 
--   220000 224999      1 
--   225000 229999      2 
--   230000 234999      0 
--   235000 239999      0 
--   240000 244999      0 
--   245000 249999      0 
--   250000 254999      1 
--   255000 259999      1 
----------------------------------------
-- Starting command on Tue Aug 13 16:32:20 2019 with 362985.981 GB free disk space

    cd correction/0-mercounts
    ./meryl-configure.sh \
    > ./meryl-configure.err 2>&1

-- Finished on Tue Aug 13 16:32:24 2019 (4 seconds) with 362985.981 GB free disk space
----------------------------------------
--  segments   memory batches
--  -------- -------- -------
--        01 14.50 GB       1
--        02 14.00 GB       1
--        04 13.50 GB       1
--        06 13.00 GB       1
--        08 12.00 GB       1
--        12  8.00 GB       1
--        16  6.00 GB       1
--        20  5.00 GB       1
--        24  4.00 GB       1
--        32  3.00 GB       1
--        40  2.50 GB       1
--        48  2.00 GB       1
--        56  2.00 GB       1
--        64  1.50 GB       1
--        96  1.19 GB       1
--
--  For 2756782 reads with 42994899934 bases, limit to 429 batches.
--  Will count kmers using 01 jobs, each using 16 GB and 8 threads.
--
-- Finished stage 'merylConfigure', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
--
-- 'meryl-count.jobSubmit-01.sh' -> job 7225836 task 1.
--
----------------------------------------
-- Starting command on Tue Aug 13 16:32:24 2019 with 362985.981 GB free disk space

    cd /work/gbarreau/melon/doublon/asm/a2/doublon-mix
    sbatch \
      --depend=afterany:7225836 \
      --mem-per-cpu=4g \
      --cpus-per-task 1   \
      -D `pwd` \
      -J 'canu_doublon-mix' \
      -o canu-scripts/canu.01.out  canu-scripts/canu.01.sh

-- Finished on Tue Aug 13 16:32:24 2019 (in the blink of an eye) with 362985.981 GB free disk space
----------------------------------------

Thanks for your help!

skoren commented 5 years ago

This is the expected behavior, see the docs: https://canu.readthedocs.io/en/latest/tutorial.html#execution-configuration. Canu is submitting itself to the grid to continue running, if you check your running jobs you'll see the canu processes continuing on. The canu.out file in the assembly folder (doublon-mix in your case) will have the output of the latest step.

matryoskina commented 5 years ago

Hi Sergey, Actually I have the same problem. I submit canu on a SLURM cluster. The slurm.log file looks exactly like that one of guibar (except that I have "3 hosts with 128 cores and 3023 GB memory under Slurm control") but the canu.out report an error, so actually canu does not complete the correction step, without a clear message about the problem

cat canu.out

Found perl:
   /usr/bin/perl
   This is perl 5, version 22, subversion 1 (v5.22.1) built for x86_64-linux-gnu-thread-multi

Found java:
   /usr/bin/java
   openjdk version "1.8.0_191"

Found canu:
   /net/cephfs/home/arusso/softwares/canu-1.8/Linux-amd64/bin/canu
   Canu 1.8

-- Canu 1.8
--
-- CITATIONS
--
-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
-- Genome Res. 2017 May;27(5):722-736.
-- http://doi.org/10.1101/gr.215087.116
-- 
-- Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM.
-- De novo assembly of haplotype-resolved genomes with trio binning.
-- Nat Biotechnol. 2018
-- https//doi.org/10.1038/nbt.4277
-- 
-- Read and contig alignments during correction, consensus and GFA building use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
-- 
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_191' (from 'java') with -d64 support.
-- Detected gnuplot version '5.0 patchlevel 3   ' (from 'gnuplot') and image format 'png'.
-- Detected 128 CPUs and 3024 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with 'MaxArraySize' limited to 9999 jobs.
-- 
-- Found   3 hosts with 128 cores and 3023 GB memory under Slurm control.
--
--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl     64 GB    8 CPUs  (k-mer counting)
-- Grid:  hap       16 GB   64 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   64 GB   16 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    24 GB   16 CPUs  (overlap detection)
-- Grid:  utgovl    24 GB   16 CPUs  (overlap detection)
-- Grid:  ovb        4 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32 GB    1 CPU   (overlap store sorting)
-- Grid:  red       20 GB    8 CPUs  (read error detection)
-- Grid:  oea        8 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat     1024 GB   64 CPUs  (contig construction with bogart)
-- Grid:  gfa       64 GB   64 CPUs  (GFA alignment and processing)
--
-- In 'canu_assembly.seqStore', found Nanopore reads:
--   Raw:        19078404
--   Corrected:  0
--   Trimmed:    0
--
-- Generating assembly 'canu_assembly' in '/net/cephfs/scratch/arusso/canu/results'
--
-- Parameters:
--
--  genomeSize        5000000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.1200 ( 12.00%)
--    utgOvlErrorRate 0.1200 ( 12.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.1200 ( 12.00%)
--    utgErrorRate    0.1200 ( 12.00%)
--    cnsErrorRate    0.2000 ( 20.00%)
--
--
-- BEGIN CORRECTION
--
--
-- Kmer counting (meryl-count) jobs failed, tried 2 times, giving up.
--   job canu_assembly.01.meryl FAILED.
--

ABORT:
ABORT: Canu 1.8
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

I tried the correction step with a very small subset of my reads, and it completed the step succesfully. Any idea what it could be? Thanks

skoren commented 5 years ago

This is not the same issue but most likely you need to include additional grid options on your system to run canu (see FAQ regarding gridOptions). Post the output from any out files in the correction/0-mercounts folder (something like meryl-count*out), they have more information on the error.

matryoskina commented 5 years ago

cat meryl-count.2318411_1.out

Found perl: /usr/bin/perl This is perl 5, version 22, subversion 1 (v5.22.1) built for x86_64-linux-gnu-thread-multi

Found java: /usr/bin/java openjdk version "1.8.0_191"

Found canu: /net/cephfs/home/arusso/softwares/canu-1.8/Linux-amd64/bin/canu Canu 1.8

Running job 1 based on SLURM_ARRAY_TASK_ID=1 and offset=0.

Counting 186 billion canonical 16-mers from 1 input file: canu-seqStore: ../../canu_assembly.seqStore

SIMPLE MODE

16-mers -> 4294967296 entries for counts up to 65535. -> 64 Gbits memory used

200215868167 input bases -> expected max count of 800863472, needing 15 extra bits. -> 60 Gbits memory used

15 GB memory needed

COMPLEX MODE

prefix # of struct kmers/ segs/ data total bits prefix memory prefix prefix memory memory

 1     2  P    90 MB    93 GM  5783 kS   722 GB   722 GB
 2     4  P    87 MB    46 GM  2798 kS   699 GB   699 GB
 3     8  P    84 MB    23 GM  1352 kS   675 GB   676 GB
 4    16  P    81 MB    11 GM   652 kS   652 GB   652 GB
 5    32  P    78 MB  5966 MM   314 kS   629 GB   629 GB
 6    64  P    75 MB  2983 MM   151 kS   606 GB   606 GB
 7   128  P    72 MB  1491 MM    72 kS   582 GB   582 GB
 8   256  P    69 MB   745 MM    34 kS   559 GB   559 GB
 9   512  P    67 MB   372 MM    16 kS   536 GB   536 GB
10  1024  P    64 MB   186 MM  8209  S   512 GB   512 GB
11  2048  P    61 MB    93 MM  3918  S   489 GB   489 GB
12  4096  P    58 MB    46 MM  1866  S   466 GB   466 GB
13  8192  P    56 MB    23 MM   887  S   443 GB   443 GB
14    16 kP    54 MB    11 MM   420  S   419 GB   419 GB
15    32 kP    52 MB  5966 kM   199  S   397 GB   397 GB
16    64 kP    53 MB  2983 kM    94  S   375 GB   375 GB
17   128 kP    56 MB  1491 kM    44  S   351 GB   351 GB
18   256 kP    66 MB   745 kM    21  S   335 GB   335 GB
19   512 kP    88 MB   372 kM    10  S   319 GB   319 GB
20  1024 kP   136 MB   186 kM     5  S   319 GB   319 GB
21  2048 kP   240 MB    93 kM     3  S   383 GB   384 GB
22  4096 kP   416 MB    46 kM     1  S   255 GB   256 GB  Best Value!
23  8192 kP   832 MB    23 kM     1  S   511 GB   512 GB
24    16 MP  1664 MB    11 kM     1  S  1023 GB  1025 GB
25    32 MP  3328 MB  5967  M     1  S  2047 GB  2050 GB

FINAL CONFIGURATION

Configured simple mode for 15.500 GB memory per batch, and up to 1 batch.

Loading kmers from '../../canu_assembly.seqStore' into buckets. Increasing to 17-bit storage (for kmer 0x0000000022222222). Increasing to 18-bit storage (for kmer 0x0000000022222222). Increasing to 19-bit storage (for kmer 0x0000000022222222). slurmstepd: JOB 2318411 ON hydra2 CANCELLED AT 2019-10-14T17:30:34 DUE TO TIME LIMIT

It seems I have to increase the time limit for each batch job, is that right? Should I do it via Canu or via the SLURM system? Thanks a lot

skoren commented 5 years ago

You need to tell canu how to request more time (see: https://canu.readthedocs.io/en/latest/faq.html#how-do-i-run-canu-on-my-slurm-sge-pbs-lsf-torque-system). I'm not sure what your default time is but yo could request something like 3 days. You need to add gridOptions="--time 72:00:00", assuming your partition allows that time limit.

marbl / canu