LSF submission issue halting assembly

photocyte commented 7 years ago

Dear Canu developers,

I've run into a canu issue with LSF job submissions that seem to be halting the assembly process, using the most recent version of canu (77d84666cefcadf7f8d6ca5c18fbe1ae60f4c233)

The overall canu job was submitted as follows:

CANU_CMD=/lab/solexa_weng/testtube/canu/Linux-amd64/bin/canu
READS=./reads/*.fastq.gz
$CANU_CMD -p canu_troubleshooting_r1 -d canu_troubleshooting genomeSize=600m -pacbio-raw $READS "gridOptions=-q 14"

canu seemed to get through canu.01.sh , canu.02.sh, canu.03.sh, and halted on canu.04.sh. I don't think there is anything malformed in the reads. Log ("canu.out") is below:

-- Canu v1.4 (+26 commits) r8021 77d84666cefcadf7f8d6ca5c18fbe1ae60f4c233.
-- Detected Java(TM) Runtime Environment '1.8.0_101' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 4' (from 'gnuplot') and image format 'png'.
-- Detected 24 CPUs and 189 gigabytes of memory.
-- Detected LSF with 'bsub' binary in /nfs/lsf8/8.0/linux2.6-glibc2.3-x86_64/bin/bsub.
-- Warning: unknown memory units for grid engine LSF assuming KB
-- 
-- Found   1 host  with  24 cores and  251 GB memory under LSF control.
-- Found   1 host  with  32 cores and 1007 GB memory under LSF control.
-- Found   1 host  with   4 cores and    7 GB memory under LSF control.
-- Found   4 hosts with  24 cores and  283 GB memory under LSF control.
-- Found   1 host  with  40 cores and  251 GB memory under LSF control.
-- Found   6 hosts with  32 cores and  251 GB memory under LSF control.
-- Found   1 host  with  32 cores and  125 GB memory under LSF control.
-- Found   5 hosts with  24 cores and  141 GB memory under LSF control.
-- Found   1 host  with  32 cores and  188 GB memory under LSF control.
-- Found   2 hosts with   2 cores and    1 GB memory under LSF control.
-- Found   1 host  with 120 cores and 1007 GB memory under LSF control.
-- Found  17 hosts with  24 cores and  188 GB memory under LSF control.
-- Found   3 hosts with  24 cores and  377 GB memory under LSF control.
-- 
-- On LSF detected memory is requested in KB
-- 
--
-- Allowed to run under grid control, and use up to   8 compute threads and   41 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and   16 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #2, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--   
--   genomeSize          -- 600000000
--   errorRate           -- 0.015
--   
--   corOvlErrorRate     -- 0.045
--   obtOvlErrorRate     -- 0.045
--   utgOvlErrorRate     -- 0.045
--   
--   obtErrorRate        -- 0.045
--   
--   cnsErrorRate        -- 0.045
--
--
-- BEGIN CORRECTION
--
-- All 32 mhap precompute jobs finished successfully.
-- mhap attempt 1 begins with 0 finished, and 77 to compute.
----------------------------------------
-- Starting command on Fri Dec 30 12:44:46 2016 with 995.32 GB free disk space

      bsub \
        -q 14  -M 32505856 -n 8 \
        -J "cormhap_canu_troubleshooting_r1[1-77]" \
         \
        -o /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/correction/1-overlapper/mhap.%I.out \
        /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/correction/1-overlapper/mhap.sh 0

Job <769810> is submitted to queue <14>.

-- Finished on Fri Dec 30 12:44:46 2016 (lickety-split) with 995.32 GB free disk space
----------------------------------------
----------------------------------------
-- Starting command on Fri Dec 30 12:44:46 2016 with 995.32 GB free disk space

    bsub \
      -M 4194304 \
      -n 1 \
      -q 14  \
      -w "ended("cormhap_canu_troubleshooting_r1[1-77]")" \
      -J "canu_canu_troubleshooting_r1" \
      -o /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/canu-scripts/canu.04.out /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/canu-scripts/canu.04.sh
769804[33]: No matching job found. Job not submitted.

-- Finished on Fri Dec 30 12:44:46 2016 (lickety-split) with 995.32 GB free disk space
----------------------------------------
ERROR:
ERROR:  Failed with exit code 255.  (rc=65280)
ERROR:
================================================================================
Please panic.  canu failed, and it shouldn't have.

Stack trace:

 at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Execution.pm line 1308.
    canu::Execution::caFailure('Failed to submit script', undef) called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Execution.pm line 667
    canu::Execution::submitScript('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Ppyra...', 'canu_troubleshooting_r1', 'cormhap_canu_troubleshooting_r1[1-77]') called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Execution.pm line 1016
    canu::Execution::submitOrRunParallelJob('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Ppyra...', 'canu_troubleshooting_r1', 'cormhap', '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Ppyra...', 'mhap', 1, 2, 3, 4, ...) called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/OverlapMhap.pm line 713
    canu::OverlapMhap::mhapCheck('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Ppyra...', 'canu_troubleshooting_r1', 'cor', 'partial') called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/canu line 429
    main::overlap('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Ppyra...', 'canu_troubleshooting_r1', 'cor') called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/canu line 473

canu failed with 'Failed to submit script'.

------------------------------------------------------------
Sender: LSF System <lsfadmin@it-c01b13>
Subject: Job 769805: <canu_canu_troubleshooting_r1> Exited

Job <canu_canu_troubleshooting_r1> was submitted from host <it-c01b11> by user <tfallon> in cluster <irken>.
Job was executed on host(s) <it-c01b13>, in queue <14>, as user <tfallon> in cluster <irken>.
</home/tfallon> was used as the home directory.
</lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting> was used as the working directory.
Started at Fri Dec 30 12:44:40 2016
Results reported at Fri Dec 30 12:44:46 2016

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/canu-scripts/canu.03.sh
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :      0.68 sec.
    Max Memory :         3 MB
    Max Swap   :        37 MB

    Max Processes  :         1
    Max Threads    :         1

The output (if any) is above this job summary.

The issue seems to be with the second bsub submission with the "-w "ended("cormhap_canu_troubleshooting_r1[1-77]")" " parameter. Perhaps the formatting of the parameter is incorrect, or perhaps it needs to be submitted while the jobs are present in the queue (perhaps it is being submitted too early or too late, before LSF knows about the jobs, or after it forgets about them?)

Oddly enough, despite the parent canu script exiting & giving the log shown above, something was submitted, as I'm still seeing jobs on the cluster from canu:

769810  tfallon RUN   14         it-c01b13   it-c01b13:it-c01b13:it-c01b13:it-c01b13:it-c01b13:it-c01b13:it-c01b13:it-c01b13 cormhap_canu_troubleshooting_r1[1] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c01b13:it-c01b13:it-c01b13:it-c01b13:it-c01b13:it-c01b13:it-c01b13:it-c01b13 cormhap_canu_troubleshooting_r1[24] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c01b13:it-c01b13:it-c01b13:it-c01b13:it-c01b13:it-c01b13:it-c01b13:it-c01b13 cormhap_canu_troubleshooting_r1[47] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c01b02:it-c01b02:it-c01b02:it-c01b02:it-c01b02:it-c01b02:it-c01b02:it-c01b02 cormhap_canu_troubleshooting_r1[2] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c01b02:it-c01b02:it-c01b02:it-c01b02:it-c01b02:it-c01b02:it-c01b02:it-c01b02 cormhap_canu_troubleshooting_r1[25] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c01b02:it-c01b02:it-c01b02:it-c01b02:it-c01b02:it-c01b02:it-c01b02:it-c01b02 cormhap_canu_troubleshooting_r1[48] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c01b04:it-c01b04:it-c01b04:it-c01b04:it-c01b04:it-c01b04:it-c01b04:it-c01b04 cormhap_canu_troubleshooting_r1[3] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c01b04:it-c01b04:it-c01b04:it-c01b04:it-c01b04:it-c01b04:it-c01b04:it-c01b04 cormhap_canu_troubleshooting_r1[26] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c01b04:it-c01b04:it-c01b04:it-c01b04:it-c01b04:it-c01b04:it-c01b04:it-c01b04 cormhap_canu_troubleshooting_r1[49] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c01b11:it-c01b11:it-c01b11:it-c01b11:it-c01b11:it-c01b11:it-c01b11:it-c01b11 cormhap_canu_troubleshooting_r1[4] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c01b11:it-c01b11:it-c01b11:it-c01b11:it-c01b11:it-c01b11:it-c01b11:it-c01b11 cormhap_canu_troubleshooting_r1[27] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c01b11:it-c01b11:it-c01b11:it-c01b11:it-c01b11:it-c01b11:it-c01b11:it-c01b11 cormhap_canu_troubleshooting_r1[50] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c05b07:it-c05b07:it-c05b07:it-c05b07:it-c05b07:it-c05b07:it-c05b07:it-c05b07 cormhap_canu_troubleshooting_r1[5] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c05b07:it-c05b07:it-c05b07:it-c05b07:it-c05b07:it-c05b07:it-c05b07:it-c05b07 cormhap_canu_troubleshooting_r1[31] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c05b07:it-c05b07:it-c05b07:it-c05b07:it-c05b07:it-c05b07:it-c05b07:it-c05b07 cormhap_canu_troubleshooting_r1[54] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c05b15:it-c05b15:it-c05b15:it-c05b15:it-c05b15:it-c05b15:it-c05b15:it-c05b15 cormhap_canu_troubleshooting_r1[6] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c05b15:it-c05b15:it-c05b15:it-c05b15:it-c05b15:it-c05b15:it-c05b15:it-c05b15 cormhap_canu_troubleshooting_r1[28] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c05b15:it-c05b15:it-c05b15:it-c05b15:it-c05b15:it-c05b15:it-c05b15:it-c05b15 cormhap_canu_troubleshooting_r1[51] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c05b04:it-c05b04:it-c05b04:it-c05b04:it-c05b04:it-c05b04:it-c05b04:it-c05b04 cormhap_canu_troubleshooting_r1[7] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c05b04:it-c05b04:it-c05b04:it-c05b04:it-c05b04:it-c05b04:it-c05b04:it-c05b04 cormhap_canu_troubleshooting_r1[29] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c05b04:it-c05b04:it-c05b04:it-c05b04:it-c05b04:it-c05b04:it-c05b04:it-c05b04 cormhap_canu_troubleshooting_r1[52] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c05b13:it-c05b13:it-c05b13:it-c05b13:it-c05b13:it-c05b13:it-c05b13:it-c05b13 cormhap_canu_troubleshooting_r1[8] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c05b13:it-c05b13:it-c05b13:it-c05b13:it-c05b13:it-c05b13:it-c05b13:it-c05b13 cormhap_canu_troubleshooting_r1[30] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c05b13:it-c05b13:it-c05b13:it-c05b13:it-c05b13:it-c05b13:it-c05b13:it-c05b13 cormhap_canu_troubleshooting_r1[53] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c05b11:it-c05b11:it-c05b11:it-c05b11:it-c05b11:it-c05b11:it-c05b11:it-c05b11 cormhap_canu_troubleshooting_r1[9] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c05b11:it-c05b11:it-c05b11:it-c05b11:it-c05b11:it-c05b11:it-c05b11:it-c05b11 cormhap_canu_troubleshooting_r1[32] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c05b11:it-c05b11:it-c05b11:it-c05b11:it-c05b11:it-c05b11:it-c05b11:it-c05b11 cormhap_canu_troubleshooting_r1[55] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c06b04:it-c06b04:it-c06b04:it-c06b04:it-c06b04:it-c06b04:it-c06b04:it-c06b04 cormhap_canu_troubleshooting_r1[10] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c06b04:it-c06b04:it-c06b04:it-c06b04:it-c06b04:it-c06b04:it-c06b04:it-c06b04 cormhap_canu_troubleshooting_r1[33] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c06b05:it-c06b05:it-c06b05:it-c06b05:it-c06b05:it-c06b05:it-c06b05:it-c06b05 cormhap_canu_troubleshooting_r1[11] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c06b05:it-c06b05:it-c06b05:it-c06b05:it-c06b05:it-c06b05:it-c06b05:it-c06b05 cormhap_canu_troubleshooting_r1[34] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c06b06:it-c06b06:it-c06b06:it-c06b06:it-c06b06:it-c06b06:it-c06b06:it-c06b06 cormhap_canu_troubleshooting_r1[12] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c06b06:it-c06b06:it-c06b06:it-c06b06:it-c06b06:it-c06b06:it-c06b06:it-c06b06 cormhap_canu_troubleshooting_r1[35] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c06b07:it-c06b07:it-c06b07:it-c06b07:it-c06b07:it-c06b07:it-c06b07:it-c06b07 cormhap_canu_troubleshooting_r1[13] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c06b07:it-c06b07:it-c06b07:it-c06b07:it-c06b07:it-c06b07:it-c06b07:it-c06b07 cormhap_canu_troubleshooting_r1[36] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c06b08:it-c06b08:it-c06b08:it-c06b08:it-c06b08:it-c06b08:it-c06b08:it-c06b08 cormhap_canu_troubleshooting_r1[14] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c06b08:it-c06b08:it-c06b08:it-c06b08:it-c06b08:it-c06b08:it-c06b08:it-c06b08 cormhap_canu_troubleshooting_r1[37] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c03b01:it-c03b01:it-c03b01:it-c03b01:it-c03b01:it-c03b01:it-c03b01:it-c03b01 cormhap_canu_troubleshooting_r1[15] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c03b01:it-c03b01:it-c03b01:it-c03b01:it-c03b01:it-c03b01:it-c03b01:it-c03b01 cormhap_canu_troubleshooting_r1[38] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c03b03:it-c03b03:it-c03b03:it-c03b03:it-c03b03:it-c03b03:it-c03b03:it-c03b03 cormhap_canu_troubleshooting_r1[16] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c03b03:it-c03b03:it-c03b03:it-c03b03:it-c03b03:it-c03b03:it-c03b03:it-c03b03 cormhap_canu_troubleshooting_r1[39] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c03b08:it-c03b08:it-c03b08:it-c03b08:it-c03b08:it-c03b08:it-c03b08:it-c03b08 cormhap_canu_troubleshooting_r1[17] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c03b08:it-c03b08:it-c03b08:it-c03b08:it-c03b08:it-c03b08:it-c03b08:it-c03b08 cormhap_canu_troubleshooting_r1[40] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c01b06:it-c01b06:it-c01b06:it-c01b06:it-c01b06:it-c01b06:it-c01b06:it-c01b06 cormhap_canu_troubleshooting_r1[18] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c01b06:it-c01b06:it-c01b06:it-c01b06:it-c01b06:it-c01b06:it-c01b06:it-c01b06 cormhap_canu_troubleshooting_r1[41] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c01b09:it-c01b09:it-c01b09:it-c01b09:it-c01b09:it-c01b09:it-c01b09:it-c01b09 cormhap_canu_troubleshooting_r1[19] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-c01b09:it-c01b09:it-c01b09:it-c01b09:it-c01b09:it-c01b09:it-c01b09:it-c01b09 cormhap_canu_troubleshooting_r1[42] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-r16u30:it-r16u30:it-r16u30:it-r16u30:it-r16u30:it-r16u30:it-r16u30:it-r16u30 cormhap_canu_troubleshooting_r1[20] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-r16u30:it-r16u30:it-r16u30:it-r16u30:it-r16u30:it-r16u30:it-r16u30:it-r16u30 cormhap_canu_troubleshooting_r1[43] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-r16u29:it-r16u29:it-r16u29:it-r16u29:it-r16u29:it-r16u29:it-r16u29:it-r16u29 cormhap_canu_troubleshooting_r1[21] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-r16u29:it-r16u29:it-r16u29:it-r16u29:it-r16u29:it-r16u29:it-r16u29:it-r16u29 cormhap_canu_troubleshooting_r1[45] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-r16u25:it-r16u25:it-r16u25:it-r16u25:it-r16u25:it-r16u25:it-r16u25:it-r16u25 cormhap_canu_troubleshooting_r1[22] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   it-r16u25:it-r16u25:it-r16u25:it-r16u25:it-r16u25:it-r16u25:it-r16u25:it-r16u25 cormhap_canu_troubleshooting_r1[44] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   airstream:airstream:airstream:airstream:airstream:airstream:airstream:airstream cormhap_canu_troubleshooting_r1[23] Dec 30 12:44
769810  tfallon RUN   14         it-c01b13   airstream:airstream:airstream:airstream:airstream:airstream:airstream:airstream cormhap_canu_troubleshooting_r1[46] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[56] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[57] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[58] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[59] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[60] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[61] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[62] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[63] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[64] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[65] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[66] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[67] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[68] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[69] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[70] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[71] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[72] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[73] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[74] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[75] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[76] Dec 30 12:44
769810  tfallon PEND  14         it-c01b13      -        cormhap_canu_troubleshooting_r1[77] Dec 30 12:44

Thoughts? Unfortunately I'm not too familiar with array jobs on LSF, but would be happy to run any troubleshooting needed.

All the best, -Tim

photocyte commented 7 years ago

These pull requests and issues may be relevant: https://github.com/marbl/canu/issues/284 https://github.com/marbl/canu/pull/39

skoren commented 7 years ago

It's strange that the run proceeded fine up to this point since the previous steps do the same array job submission followed by a hold on the array. My best guess is that here the array job wasn't yet registered when the holding job was submitted, causing an error. What happens if you run the submit command now:

    bsub \
      -M 4194304 \
      -n 1 \
      -q 14  \
      -w "ended("cormhap_canu_troubleshooting_r1[1-77]")" \
      -J "canu_canu_troubleshooting_r1" \
      -o /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/canu-scripts/canu.04.out /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/canu-scripts/canu.04.sh

If it works (since there are clearly jobs named "cormhap_canu_troubleshooting_r1" running) I think the likeliest explanation is the race condition. We could add a delay between the two submission commands to give the grid engine enough time to register the initial job in that case.

photocyte commented 7 years ago

When executing:

bsub \
>         -q 14  -M 32505856 -n 8 \
>         -J "cormhap_canu_troubleshooting_r1[1-77]" \
>          \
>         -o /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/correction/1-overlapper/mhap.%I.out \
>         /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/correction/1-overlapper/mhap.sh 0

With the following after about 15 seconds:

bsub \
>       -M 4194304 \
>       -n 1 \
>       -q 14  \
>       -w "ended("cormhap_canu_troubleshooting_r1[1-77]")" \
>       -J "canu_canu_troubleshooting_r1" \
>       -o /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/canu-scripts/canu.04.out /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/canu-scripts/canu.04.sh

It does successfully submit, suggesting it is a race condition issue where the second bsub command may be getting submitted too early. That being said, the majority of the array jobs for the first bsub seemed to finish very quickly (at 5 minutes of execution only 1 is still running), so too long of a delay could potentially be problematic as well.

brianwalenz commented 7 years ago

What are the details of job 769804? I'm guessing it has the same name ("cormhap_canu_troubleshooting_r1") and only has 32 tasks. I suspect LSF is picking up this earlier job instead of the later one (with 77 tasks). If so, changing line 591 in src/pipelines/canu/OverlapMhap.pm from "${tag}mhap" to "${tag}mpre" should resolve the problem. This just changes the name of the earlier job to something different.

photocyte commented 7 years ago

Hi Brian,

The earlier submission with 32 tasks does have the "cormhap_canu_troubleshooting_r1" name.

I made the change suggested on line 591 of OverlapMhap.pm: submitOrRunParallelJob($WRK, $asm, "${tag}mpre", $path, "precompute", @failedJobs);

and recompiled and reran, but it seems to crash at the "canu.02.sh" stage with a different error:

-- Canu v1.4 (+26 commits) r8021 77d84666cefcadf7f8d6ca5c18fbe1ae60f4c233.
-- Detected Java(TM) Runtime Environment '1.8.0_101' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 4' (from 'gnuplot') and image format 'png'.
-- Detected 24 CPUs and 189 gigabytes of memory.
-- Detected LSF with 'bsub' binary in /nfs/lsf8/8.0/linux2.6-glibc2.3-x86_64/bin/bsub.
-- Warning: unknown memory units for grid engine LSF assuming KB
-- 
-- Found   1 host  with  40 cores and  251 GB memory under LSF control.
-- Found   1 host  with  32 cores and  188 GB memory under LSF control.
-- Found   5 hosts with  24 cores and  141 GB memory under LSF control.
-- Found   1 host  with 120 cores and 1007 GB memory under LSF control.
-- Found   3 hosts with  24 cores and  377 GB memory under LSF control.
-- Found   2 hosts with   2 cores and    1 GB memory under LSF control.
-- Found  15 hosts with  24 cores and  188 GB memory under LSF control.
-- Found   4 hosts with  24 cores and  283 GB memory under LSF control.
-- Found   1 host  with  32 cores and  125 GB memory under LSF control.
-- Found   1 host  with   4 cores and    7 GB memory under LSF control.
-- Found   1 host  with  32 cores and 1007 GB memory under LSF control.
-- Found   6 hosts with  32 cores and  251 GB memory under LSF control.
-- Found   1 host  with  24 cores and  251 GB memory under LSF control.
-- 
-- On LSF detected memory is requested in KB
-- 
--
-- Allowed to run under grid control, and use up to   8 compute threads and   41 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and   16 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #2, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--   
--   genomeSize          -- 600000000
--   errorRate           -- 0.015
--   
--   corOvlErrorRate     -- 0.045
--   obtOvlErrorRate     -- 0.045
--   utgOvlErrorRate     -- 0.045
--   
--   obtErrorRate        -- 0.045
--   
--   cnsErrorRate        -- 0.045
--
--
-- BEGIN CORRECTION
--
-- Meryl finished successfully.
----------------------------------------
-- Starting command on Tue Jan  3 11:00:58 2017 with 1005.063 GB free disk space

    /lab/solexa_weng/testtube/canu_changed/Linux-amd64/bin/meryl \
      -Dh \
      -s /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_mpre/correction/0-mercounts/canu_troubleshooting_mpre.ms16 \
    > /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_mpre/correction/0-mercounts/canu_troubleshooting_mpre.ms16.histogram \
    2> /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_mpre/correction/0-mercounts/canu_troubleshooting_mpre.ms16.histogram.info

-- Finished on Tue Jan  3 11:00:59 2017 (1 second) with 1005.061 GB free disk space
----------------------------------------
-- For mhap overlapping, set repeat k-mer threshold to 210846.
--
-- Found 21084654556 16-mers; 1964948287 distinct and 284579777 unique.  Largest count 9217106.
--
-- OVERLAPPER (mhap) (correction)
--
-- Set corMhapSensitivity=normal based on read coverage of 35.
--
-- PARAMETERS: hashes=512, minMatches=3, threshold=0.78
--
-- Given 31 GB, can fit 93000 reads per block.
-- For 33 blocks, set stride to 8 blocks.
-- Logging partitioning to '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_mpre/correction/1-overlapper/partitioning.log'.
-- Configured 32 mhap precompute jobs.
-- Configured 77 mhap overlap jobs.
-- mhap precompute attempt 1 begins with 0 finished, and 32 to compute.
================================================================================
Unknown parameter 'cormprememory' accessed.  Stack trace:
 at /lab/solexa_weng/testtube/canu_changed/Linux-amd64/bin/lib/canu/Defaults.pm line 72.
    canu::Defaults::getGlobal('cormpreMemory') called at /lab/solexa_weng/testtube/canu_changed/Linux-amd64/bin/lib/canu/Execution.pm line 957
    canu::Execution::submitOrRunParallelJob('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_...', 'canu_troubleshooting_mpre', 'cormpre', '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_...', 'precompute', 1, 2, 3, 4, ...) called at /lab/solexa_weng/testtube/canu_changed/Linux-amd64/bin/lib/canu/OverlapMhap.pm line 591
    canu::OverlapMhap::mhapPrecomputeCheck('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_...', 'canu_troubleshooting_mpre', 'cor', 'partial') called at /lab/solexa_weng/testtube/canu_changed/Linux-amd64/bin/canu line 425
    main::overlap('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_...', 'canu_troubleshooting_mpre', 'cor') called at /lab/solexa_weng/testtube/canu_changed/Linux-amd64/bin/canu line 473

------------------------------------------------------------
Sender: LSF System <lsfadmin@it-c01b13>
Subject: Job 802704: <canu_canu_troubleshooting_mpre> Exited

Job <canu_canu_troubleshooting_mpre> was submitted from host <it-c01b02> by user <tfallon> in cluster <irken>.
Job was executed on host(s) <it-c01b13>, in queue <14>, as user <tfallon> in cluster <irken>.
</home/tfallon> was used as the home directory.
</lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_mpre> was used as the working directory.
Started at Tue Jan  3 11:00:56 2017
Results reported at Tue Jan  3 11:03:15 2017

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_mpre/canu-scripts/canu.02.sh
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :    131.70 sec.
    Max Memory :        92 MB
    Max Swap   :       170 MB

    Max Processes  :         7
    Max Threads    :         8

The output (if any) is above this job summary.

brianwalenz commented 7 years ago

Sorry for the false lead.

I reworked how job dependencies are created. Can you give the latest a try? I'd appreciate it if you could exercise it a bit too:

update the source code and rebuild (remove the 'mpre' baloney).
restart canu with the same command.
it should very briefly run the 'canu' job, then submit a job array with 77 tasks for 'mhap'.
delete any two (or more) non-consecutive tasks (e.g, task 3 and 5).
wait for the mhap jobs to finish.
check canu.*.out -- I wouldn't mind seeing the one that resubmits the deleted tasks.

photocyte commented 7 years ago

Hi there,

Thanks for the help! Tried with the new version, think there is a new error:

-- Canu v1.4 (+42 commits) r8037 4ece307bc793c3bc61628526429c224c477c2224.
-- Detected Java(TM) Runtime Environment '1.8.0_101' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 4' (from 'gnuplot') and image format 'png'.
-- Detected 24 CPUs and 189 gigabytes of memory.
-- Detected LSF with 'bsub' binary in /nfs/lsf8/8.0/linux2.6-glibc2.3-x86_64/bin/bsub.
-- Warning: unknown memory units for grid engine LSF assuming KB
-- 
-- Found   1 host  with  24 cores and  251 GB memory under LSF control.
-- Found   1 host  with  32 cores and 1007 GB memory under LSF control.
-- Found   1 host  with 120 cores and 1007 GB memory under LSF control.
-- Found   2 hosts with   2 cores and    1 GB memory under LSF control.
-- Found   1 host  with  40 cores and  251 GB memory under LSF control.
-- Found   4 hosts with  24 cores and  283 GB memory under LSF control.
-- Found   1 host  with  32 cores and  125 GB memory under LSF control.
-- Found   6 hosts with  32 cores and  251 GB memory under LSF control.
-- Found   1 host  with   4 cores and    7 GB memory under LSF control.
-- Found   5 hosts with  24 cores and  141 GB memory under LSF control.
-- Found  17 hosts with  24 cores and  188 GB memory under LSF control.
-- Found   1 host  with  32 cores and  188 GB memory under LSF control.
-- Found   3 hosts with  24 cores and  377 GB memory under LSF control.
-- 
-- On LSF detected memory is requested in KB
-- 
--
-- Allowed to run under grid control, and use up to   8 compute threads and   41 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and   16 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #1, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--   
--   genomeSize          -- 600000000
--   errorRate           -- 0.015
--   
--   corOvlErrorRate     -- 0.045
--   obtOvlErrorRate     -- 0.045
--   utgOvlErrorRate     -- 0.045
--   
--   obtErrorRate        -- 0.045
--   
--   cnsErrorRate        -- 0.045
--
--
-- BEGIN CORRECTION
--
----------------------------------------
-- Starting command on Tue Jan 10 15:43:33 2017 with 995.143 GB free disk space

    /lab/solexa_weng/testtube/canu/Linux-amd64/bin/gatekeeperCreate \
      -minlength 1000 \
      -o /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6/correction/canu_troubleshooting_r6.gkpStore.BUILDING \
      /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6/correction/canu_troubleshooting_r6.gkpStore.gkp \
    > /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6/correction/canu_troubleshooting_r6.gkpStore.BUILDING.err 2>&1

-- Finished on Tue Jan 10 15:50:46 2017 (433 seconds) with 987.262 GB free disk space
----------------------------------------
--
-- In gatekeeper store '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6/correction/canu_troubleshooting_r6.gkpStore':
--   Found 2970660 reads.
--   Found 21129214456 bases (35.21 times coverage).
--
--   Read length histogram (one '*' equals 5485.02 reads):
--        0    999      0 
--     1000   1999 383952 **********************************************************************
--     2000   2999 364900 ******************************************************************
--     3000   3999 315020 *********************************************************
--     4000   4999 267712 ************************************************
--     5000   5999 230442 ******************************************
--     6000   6999 204624 *************************************
--     7000   7999 180294 ********************************
--     8000   8999 154326 ****************************
--     9000   9999 131065 ***********************
--    10000  10999 111716 ********************
--    11000  11999  96941 *****************
--    12000  12999  84075 ***************
--    13000  13999  74004 *************
--    14000  14999  69133 ************
--    15000  15999  68816 ************
--    16000  16999  62979 ***********
--    17000  17999  49003 ********
--    18000  18999  34850 ******
--    19000  19999  24972 ****
--    20000  20999  17421 ***
--    21000  21999  12696 **
--    22000  22999   8915 *
--    23000  23999   6353 *
--    24000  24999   4649 
--    25000  25999   3329 
--    26000  26999   2256 
--    27000  27999   1670 
--    28000  28999   1181 
--    29000  29999    918 
--    30000  30999    696 
--    31000  31999    493 
--    32000  32999    342 
--    33000  33999    258 
--    34000  34999    172 
--    35000  35999    145 
--    36000  36999    116 
--    37000  37999     66 
--    38000  38999     55 
--    39000  39999     24 
--    40000  40999     33 
--    41000  41999     17 
--    42000  42999     11 
--    43000  43999      9 
--    44000  44999      2 
--    45000  45999      4 
--    46000  46999      3 
--    47000  47999      1 
--    48000  48999      0 
--    49000  49999      0 
--    50000  50999      0 
--    51000  51999      0 
--    52000  52999      0 
--    53000  53999      0 
--    54000  54999      0 
--    55000  55999      0 
--    56000  56999      0 
--    57000  57999      0 
--    58000  58999      0 
--    59000  59999      0 
--    60000  60999      0 
--    61000  61999      1 
-- Meryl attempt 1 begins.
--
-- '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6/correction/0-mercounts/meryl.jobSubmit.sh' -> job ended() tasks 1.
--
----------------------------------------
-- Starting command on Tue Jan 10 15:51:11 2017 with 987.024 GB free disk space

    bsub \
      -w ended() \
      -M 4194304 \
      -n 1 \
      -q 14  \
      -J 'canu_canu_troubleshooting_r6' \
      -o /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6/canu-scripts/canu.02.out /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6/canu-scripts/canu.02.sh
sh: 1: Syntax error: "(" unexpected

-- Finished on Tue Jan 10 15:51:11 2017 (lickety-split) with 987.024 GB free disk space
----------------------------------------
ERROR:
ERROR:  Failed with exit code 2.  (rc=512)
ERROR:
================================================================================
Please panic.  canu failed, and it shouldn't have.

Stack trace:

 at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Execution.pm line 1361.
    canu::Execution::caFailure('Failed to submit script', undef) called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Execution.pm line 626
    canu::Execution::submitScript('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad...', 'canu_troubleshooting_r6', '-w ended()') called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Execution.pm line 1068
    canu::Execution::submitOrRunParallelJob('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad...', 'canu_troubleshooting_r6', 'meryl', '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad...', 'meryl', 1) called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Meryl.pm line 373
    canu::Meryl::merylCheck('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad...', 'canu_troubleshooting_r6', 'cor') called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/canu line 470

canu failed with 'Failed to submit script'.

------------------------------------------------------------
Sender: LSF System <lsfadmin@it-c01b11>
Subject: Job 823355: <canu_canu_troubleshooting_r6> Exited

Job <canu_canu_troubleshooting_r6> was submitted from host <tak4> by user <tfallon> in cluster <irken>.
Job was executed on host(s) <it-c01b11>, in queue <14>, as user <tfallon> in cluster <irken>.
</home/tfallon> was used as the home directory.
</lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6> was used as the working directory.
Started at Tue Jan 10 15:43:31 2017
Results reported at Tue Jan 10 15:51:11 2017

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6/canu-scripts/canu.01.sh
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :    712.10 sec.
    Max Memory :       509 MB
    Max Swap   :       836 MB

    Max Processes  :         8
    Max Threads    :         9

The output (if any) is above this job summary.

The problem seems to be in "sh: 1: Syntax error: "(" unexpected"

The relevant canu.02.sh script doesn't have any "(" or ")"

So I believe the "-w ended()" may be the issue?

brianwalenz commented 7 years ago

Yup, ended() was the problem. Unfortunately, I have no way to test LSF support.

If the commit above doesn't fix it, can you also include 'correction/0-mercounts/meryl.jobSubmit.out'?

photocyte commented 7 years ago

Hi Brian,

Almost there I think : ) That -w parameter is still being tricky:

./canu_troubleshooting_r10/canu-scripts/Canu.01.out

-- Canu v1.4 (+54 commits) r8049 7a68912097acc49ae15ebd569d316ee62658c1a3.
-- Detected Java(TM) Runtime Environment '1.8.0_101' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 4' (from 'gnuplot') and image format 'png'.
-- Detected 24 CPUs and 189 gigabytes of memory.
-- Detected LSF with 'bsub' binary in /nfs/lsf8/8.0/linux2.6-glibc2.3-x86_64/bin/bsub.
-- Warning: unknown memory units for grid engine LSF assuming KB
-- 
-- Found   1 host  with  32 cores and  125 GB memory under LSF control.
-- Found   5 hosts with  24 cores and  141 GB memory under LSF control.
-- Found   6 hosts with  32 cores and  251 GB memory under LSF control.
-- Found   1 host  with  32 cores and 1007 GB memory under LSF control.
-- Found   1 host  with  40 cores and  251 GB memory under LSF control.
-- Found   3 hosts with  24 cores and  377 GB memory under LSF control.
-- Found   1 host  with  24 cores and  251 GB memory under LSF control.
-- Found   1 host  with   4 cores and    7 GB memory under LSF control.
-- Found   1 host  with 120 cores and 1007 GB memory under LSF control.
-- Found   4 hosts with  24 cores and  283 GB memory under LSF control.
-- Found   1 host  with  32 cores and  188 GB memory under LSF control.
-- Found   2 hosts with   2 cores and    1 GB memory under LSF control.
-- Found  17 hosts with  24 cores and  188 GB memory under LSF control.
-- 
-- On LSF detected memory is requested in KB
-- 
--
-- Allowed to run under grid control, and use up to   8 compute threads and   41 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to   1 compute thread  and   16 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to   1 compute thread  and    2 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to   4 compute threads and    7 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to   8 compute threads and   31 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #1, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--   
--   genomeSize          -- 600000000
--   errorRate           -- 0.015
--   
--   corOvlErrorRate     -- 0.045
--   obtOvlErrorRate     -- 0.045
--   utgOvlErrorRate     -- 0.045
--   
--   obtErrorRate        -- 0.045
--   
--   cnsErrorRate        -- 0.045
--
--
-- BEGIN CORRECTION
--
----------------------------------------
-- Starting command on Fri Jan 20 09:39:38 2017 with 1371.253 GB free disk space

    /lab/solexa_weng/testtube/canu/Linux-amd64/bin/gatekeeperCreate \
      -minlength 1000 \
      -o /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10/correction/canu_troubleshooting_r10.gkpStore.BUILDING \
      /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10/correction/canu_troubleshooting_r10.gkpStore.gkp \
    > /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10/correction/canu_troubleshooting_r10.gkpStore.BUILDING.err 2>&1

-- Finished on Fri Jan 20 09:48:03 2017 (505 seconds) with 1364.575 GB free disk space
----------------------------------------
--
-- In gatekeeper store '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10/correction/canu_troubleshooting_r10.gkpStore':
--   Found 3577683 reads.
--   Found 24690497462 bases (41.15 times coverage).
--
--   Read length histogram (one '*' equals 7095.75 reads):
--        0    999      0 
--     1000   1999 496703 **********************************************************************
--     2000   2999 463424 *****************************************************************
--     3000   3999 391585 *******************************************************
--     4000   4999 325053 *********************************************
--     5000   5999 275250 **************************************
--     6000   6999 241893 **********************************
--     7000   7999 210930 *****************************
--     8000   8999 179574 *************************
--     9000   9999 151367 *********************
--    10000  10999 128845 ******************
--    11000  11999 110882 ***************
--    12000  12999  95748 *************
--    13000  13999  84241 ***********
--    14000  14999  78851 ***********
--    15000  15999  78425 ***********
--    16000  16999  71847 **********
--    17000  17999  55859 *******
--    18000  18999  39660 *****
--    19000  19999  28196 ***
--    20000  20999  19683 **
--    21000  21999  14294 **
--    22000  22999   9982 *
--    23000  23999   7128 *
--    24000  24999   5192 
--    25000  25999   3717 
--    26000  26999   2507 
--    27000  27999   1852 
--    28000  28999   1308 
--    29000  29999   1015 
--    30000  30999    764 
--    31000  31999    537 
--    32000  32999    374 
--    33000  33999    287 
--    34000  34999    183 
--    35000  35999    158 
--    36000  36999    126 
--    37000  37999     70 
--    38000  38999     58 
--    39000  39999     26 
--    40000  40999     36 
--    41000  41999     19 
--    42000  42999     11 
--    43000  43999     10 
--    44000  44999      2 
--    45000  45999      4 
--    46000  46999      3 
--    47000  47999      1 
--    48000  48999      0 
--    49000  49999      1 
--    50000  50999      0 
--    51000  51999      1 
--    52000  52999      0 
--    53000  53999      0 
--    54000  54999      0 
--    55000  55999      0 
--    56000  56999      0 
--    57000  57999      0 
--    58000  58999      0 
--    59000  59999      0 
--    60000  60999      0 
--    61000  61999      1 
-- Meryl attempt 1 begins.
--
-- '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10/correction/0-mercounts/meryl.jobSubmit.sh' -> job ended(849749) tasks 1.
--
----------------------------------------
-- Starting command on Fri Jan 20 09:48:33 2017 with 1364.462 GB free disk space

    bsub \
      -w "1" \
      -M 4194304 \
      -n 1 \
      -q 14  \
      -J 'canu_canu_troubleshooting_r10' \
      -o /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10/canu-scripts/canu.02.out /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10/canu-scripts/canu.02.sh
1: No matching job found. Job not submitted.

-- Finished on Fri Jan 20 09:48:33 2017 (lickety-split) with 1364.462 GB free disk space
----------------------------------------
ERROR:
ERROR:  Failed with exit code 255.  (rc=65280)
ERROR:
================================================================================
Please panic.  canu failed, and it shouldn't have.

Stack trace:

 at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Execution.pm line 1361.
    canu::Execution::caFailure('Failed to submit script', undef) called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Execution.pm line 626
    canu::Execution::submitScript('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad...', 'canu_troubleshooting_r10', '-w "1"') called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Execution.pm line 1068
    canu::Execution::submitOrRunParallelJob('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad...', 'canu_troubleshooting_r10', 'meryl', '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad...', 'meryl', 1) called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Meryl.pm line 373
    canu::Meryl::merylCheck('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad...', 'canu_troubleshooting_r10', 'cor') called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/canu line 472

canu failed with 'Failed to submit script'.

------------------------------------------------------------
Sender: LSF System <lsfadmin@it-c01b10>
Subject: Job 849747: <canu_canu_troubleshooting_r10> Exited

Job <canu_canu_troubleshooting_r10> was submitted from host <tak4> by user <tfallon> in cluster <irken>.
Job was executed on host(s) <it-c01b10>, in queue <14>, as user <tfallon> in cluster <irken>.
</home/tfallon> was used as the home directory.
</lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10> was used as the working directory.
Started at Fri Jan 20 09:39:36 2017
Results reported at Fri Jan 20 09:48:33 2017

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10/canu-scripts/canu.01.sh
------------------------------------------------------------

Exited with exit code 1.

Resource usage summary:

    CPU time   :    831.38 sec.
    Max Memory :        87 MB
    Max Swap   :       176 MB

    Max Processes  :         8
    Max Threads    :         9

The output (if any) is above this job summary.

./canu_troubleshooting_r10/correction/0-mercounts/meryl.jobSubmit.out Job <849749> is submitted to queue <14>.

brianwalenz commented 7 years ago

Thanks for the patience. Give it another try.

photocyte commented 7 years ago

Hi Brian,

Seems to be working! The assembler as of now (been running for a few hours) has progressed to canu.08.sh without any errors. Thank you for your patience in figuring out the issue!

Best, -Tim

photocyte commented 7 years ago

One last update: The original issue seems to be entirely solved. Canu went through the whole pipeline & produced contigs which are reasonably correct based on other data. Closing the issue based on that.

I did notice that I had some duplicate genes in the contigs, presumably due to heterozygosity as this is a pooled sample. Could you point me towards the most current discussions on tweaking parameters for heterozygous assembly?

marbl / canu

LSF submission issue halting assembly #321