Closed photocyte closed 7 years ago
These pull requests and issues may be relevant: https://github.com/marbl/canu/issues/284 https://github.com/marbl/canu/pull/39
It's strange that the run proceeded fine up to this point since the previous steps do the same array job submission followed by a hold on the array. My best guess is that here the array job wasn't yet registered when the holding job was submitted, causing an error. What happens if you run the submit command now:
bsub \
-M 4194304 \
-n 1 \
-q 14 \
-w "ended("cormhap_canu_troubleshooting_r1[1-77]")" \
-J "canu_canu_troubleshooting_r1" \
-o /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/canu-scripts/canu.04.out /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/canu-scripts/canu.04.sh
If it works (since there are clearly jobs named "cormhap_canu_troubleshooting_r1" running) I think the likeliest explanation is the race condition. We could add a delay between the two submission commands to give the grid engine enough time to register the initial job in that case.
When executing:
bsub \
> -q 14 -M 32505856 -n 8 \
> -J "cormhap_canu_troubleshooting_r1[1-77]" \
> \
> -o /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/correction/1-overlapper/mhap.%I.out \
> /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/correction/1-overlapper/mhap.sh 0
With the following after about 15 seconds:
bsub \
> -M 4194304 \
> -n 1 \
> -q 14 \
> -w "ended("cormhap_canu_troubleshooting_r1[1-77]")" \
> -J "canu_canu_troubleshooting_r1" \
> -o /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/canu-scripts/canu.04.out /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting/canu-scripts/canu.04.sh
It does successfully submit, suggesting it is a race condition issue where the second bsub command may be getting submitted too early. That being said, the majority of the array jobs for the first bsub seemed to finish very quickly (at 5 minutes of execution only 1 is still running), so too long of a delay could potentially be problematic as well.
What are the details of job 769804? I'm guessing it has the same name ("cormhap_canu_troubleshooting_r1") and only has 32 tasks. I suspect LSF is picking up this earlier job instead of the later one (with 77 tasks). If so, changing line 591 in src/pipelines/canu/OverlapMhap.pm from "${tag}mhap" to "${tag}mpre" should resolve the problem. This just changes the name of the earlier job to something different.
Hi Brian,
The earlier submission with 32 tasks does have the "cormhap_canu_troubleshooting_r1" name.
I made the change suggested on line 591 of OverlapMhap.pm:
submitOrRunParallelJob($WRK, $asm, "${tag}mpre", $path, "precompute", @failedJobs);
and recompiled and reran, but it seems to crash at the "canu.02.sh" stage with a different error:
-- Canu v1.4 (+26 commits) r8021 77d84666cefcadf7f8d6ca5c18fbe1ae60f4c233.
-- Detected Java(TM) Runtime Environment '1.8.0_101' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 4' (from 'gnuplot') and image format 'png'.
-- Detected 24 CPUs and 189 gigabytes of memory.
-- Detected LSF with 'bsub' binary in /nfs/lsf8/8.0/linux2.6-glibc2.3-x86_64/bin/bsub.
-- Warning: unknown memory units for grid engine LSF assuming KB
--
-- Found 1 host with 40 cores and 251 GB memory under LSF control.
-- Found 1 host with 32 cores and 188 GB memory under LSF control.
-- Found 5 hosts with 24 cores and 141 GB memory under LSF control.
-- Found 1 host with 120 cores and 1007 GB memory under LSF control.
-- Found 3 hosts with 24 cores and 377 GB memory under LSF control.
-- Found 2 hosts with 2 cores and 1 GB memory under LSF control.
-- Found 15 hosts with 24 cores and 188 GB memory under LSF control.
-- Found 4 hosts with 24 cores and 283 GB memory under LSF control.
-- Found 1 host with 32 cores and 125 GB memory under LSF control.
-- Found 1 host with 4 cores and 7 GB memory under LSF control.
-- Found 1 host with 32 cores and 1007 GB memory under LSF control.
-- Found 6 hosts with 32 cores and 251 GB memory under LSF control.
-- Found 1 host with 24 cores and 251 GB memory under LSF control.
--
-- On LSF detected memory is requested in KB
--
--
-- Allowed to run under grid control, and use up to 8 compute threads and 41 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to 4 compute threads and 7 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to 1 compute thread and 2 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to 1 compute thread and 2 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to 1 compute thread and 16 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to 1 compute thread and 2 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to 4 compute threads and 7 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to 4 compute threads and 7 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to 4 compute threads and 7 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #2, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--
-- genomeSize -- 600000000
-- errorRate -- 0.015
--
-- corOvlErrorRate -- 0.045
-- obtOvlErrorRate -- 0.045
-- utgOvlErrorRate -- 0.045
--
-- obtErrorRate -- 0.045
--
-- cnsErrorRate -- 0.045
--
--
-- BEGIN CORRECTION
--
-- Meryl finished successfully.
----------------------------------------
-- Starting command on Tue Jan 3 11:00:58 2017 with 1005.063 GB free disk space
/lab/solexa_weng/testtube/canu_changed/Linux-amd64/bin/meryl \
-Dh \
-s /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_mpre/correction/0-mercounts/canu_troubleshooting_mpre.ms16 \
> /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_mpre/correction/0-mercounts/canu_troubleshooting_mpre.ms16.histogram \
2> /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_mpre/correction/0-mercounts/canu_troubleshooting_mpre.ms16.histogram.info
-- Finished on Tue Jan 3 11:00:59 2017 (1 second) with 1005.061 GB free disk space
----------------------------------------
-- For mhap overlapping, set repeat k-mer threshold to 210846.
--
-- Found 21084654556 16-mers; 1964948287 distinct and 284579777 unique. Largest count 9217106.
--
-- OVERLAPPER (mhap) (correction)
--
-- Set corMhapSensitivity=normal based on read coverage of 35.
--
-- PARAMETERS: hashes=512, minMatches=3, threshold=0.78
--
-- Given 31 GB, can fit 93000 reads per block.
-- For 33 blocks, set stride to 8 blocks.
-- Logging partitioning to '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_mpre/correction/1-overlapper/partitioning.log'.
-- Configured 32 mhap precompute jobs.
-- Configured 77 mhap overlap jobs.
-- mhap precompute attempt 1 begins with 0 finished, and 32 to compute.
================================================================================
Unknown parameter 'cormprememory' accessed. Stack trace:
at /lab/solexa_weng/testtube/canu_changed/Linux-amd64/bin/lib/canu/Defaults.pm line 72.
canu::Defaults::getGlobal('cormpreMemory') called at /lab/solexa_weng/testtube/canu_changed/Linux-amd64/bin/lib/canu/Execution.pm line 957
canu::Execution::submitOrRunParallelJob('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_...', 'canu_troubleshooting_mpre', 'cormpre', '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_...', 'precompute', 1, 2, 3, 4, ...) called at /lab/solexa_weng/testtube/canu_changed/Linux-amd64/bin/lib/canu/OverlapMhap.pm line 591
canu::OverlapMhap::mhapPrecomputeCheck('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_...', 'canu_troubleshooting_mpre', 'cor', 'partial') called at /lab/solexa_weng/testtube/canu_changed/Linux-amd64/bin/canu line 425
main::overlap('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_...', 'canu_troubleshooting_mpre', 'cor') called at /lab/solexa_weng/testtube/canu_changed/Linux-amd64/bin/canu line 473
------------------------------------------------------------
Sender: LSF System <lsfadmin@it-c01b13>
Subject: Job 802704: <canu_canu_troubleshooting_mpre> Exited
Job <canu_canu_troubleshooting_mpre> was submitted from host <it-c01b02> by user <tfallon> in cluster <irken>.
Job was executed on host(s) <it-c01b13>, in queue <14>, as user <tfallon> in cluster <irken>.
</home/tfallon> was used as the home directory.
</lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_mpre> was used as the working directory.
Started at Tue Jan 3 11:00:56 2017
Results reported at Tue Jan 3 11:03:15 2017
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad_Pacbio1/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_mpre/canu-scripts/canu.02.sh
------------------------------------------------------------
Exited with exit code 1.
Resource usage summary:
CPU time : 131.70 sec.
Max Memory : 92 MB
Max Swap : 170 MB
Max Processes : 7
Max Threads : 8
The output (if any) is above this job summary.
Sorry for the false lead.
I reworked how job dependencies are created. Can you give the latest a try? I'd appreciate it if you could exercise it a bit too:
Hi there,
Thanks for the help! Tried with the new version, think there is a new error:
-- Canu v1.4 (+42 commits) r8037 4ece307bc793c3bc61628526429c224c477c2224.
-- Detected Java(TM) Runtime Environment '1.8.0_101' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 4' (from 'gnuplot') and image format 'png'.
-- Detected 24 CPUs and 189 gigabytes of memory.
-- Detected LSF with 'bsub' binary in /nfs/lsf8/8.0/linux2.6-glibc2.3-x86_64/bin/bsub.
-- Warning: unknown memory units for grid engine LSF assuming KB
--
-- Found 1 host with 24 cores and 251 GB memory under LSF control.
-- Found 1 host with 32 cores and 1007 GB memory under LSF control.
-- Found 1 host with 120 cores and 1007 GB memory under LSF control.
-- Found 2 hosts with 2 cores and 1 GB memory under LSF control.
-- Found 1 host with 40 cores and 251 GB memory under LSF control.
-- Found 4 hosts with 24 cores and 283 GB memory under LSF control.
-- Found 1 host with 32 cores and 125 GB memory under LSF control.
-- Found 6 hosts with 32 cores and 251 GB memory under LSF control.
-- Found 1 host with 4 cores and 7 GB memory under LSF control.
-- Found 5 hosts with 24 cores and 141 GB memory under LSF control.
-- Found 17 hosts with 24 cores and 188 GB memory under LSF control.
-- Found 1 host with 32 cores and 188 GB memory under LSF control.
-- Found 3 hosts with 24 cores and 377 GB memory under LSF control.
--
-- On LSF detected memory is requested in KB
--
--
-- Allowed to run under grid control, and use up to 8 compute threads and 41 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to 4 compute threads and 7 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to 1 compute thread and 2 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to 1 compute thread and 2 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to 1 compute thread and 16 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to 1 compute thread and 2 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to 4 compute threads and 7 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to 4 compute threads and 7 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to 4 compute threads and 7 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #1, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--
-- genomeSize -- 600000000
-- errorRate -- 0.015
--
-- corOvlErrorRate -- 0.045
-- obtOvlErrorRate -- 0.045
-- utgOvlErrorRate -- 0.045
--
-- obtErrorRate -- 0.045
--
-- cnsErrorRate -- 0.045
--
--
-- BEGIN CORRECTION
--
----------------------------------------
-- Starting command on Tue Jan 10 15:43:33 2017 with 995.143 GB free disk space
/lab/solexa_weng/testtube/canu/Linux-amd64/bin/gatekeeperCreate \
-minlength 1000 \
-o /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6/correction/canu_troubleshooting_r6.gkpStore.BUILDING \
/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6/correction/canu_troubleshooting_r6.gkpStore.gkp \
> /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6/correction/canu_troubleshooting_r6.gkpStore.BUILDING.err 2>&1
-- Finished on Tue Jan 10 15:50:46 2017 (433 seconds) with 987.262 GB free disk space
----------------------------------------
--
-- In gatekeeper store '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6/correction/canu_troubleshooting_r6.gkpStore':
-- Found 2970660 reads.
-- Found 21129214456 bases (35.21 times coverage).
--
-- Read length histogram (one '*' equals 5485.02 reads):
-- 0 999 0
-- 1000 1999 383952 **********************************************************************
-- 2000 2999 364900 ******************************************************************
-- 3000 3999 315020 *********************************************************
-- 4000 4999 267712 ************************************************
-- 5000 5999 230442 ******************************************
-- 6000 6999 204624 *************************************
-- 7000 7999 180294 ********************************
-- 8000 8999 154326 ****************************
-- 9000 9999 131065 ***********************
-- 10000 10999 111716 ********************
-- 11000 11999 96941 *****************
-- 12000 12999 84075 ***************
-- 13000 13999 74004 *************
-- 14000 14999 69133 ************
-- 15000 15999 68816 ************
-- 16000 16999 62979 ***********
-- 17000 17999 49003 ********
-- 18000 18999 34850 ******
-- 19000 19999 24972 ****
-- 20000 20999 17421 ***
-- 21000 21999 12696 **
-- 22000 22999 8915 *
-- 23000 23999 6353 *
-- 24000 24999 4649
-- 25000 25999 3329
-- 26000 26999 2256
-- 27000 27999 1670
-- 28000 28999 1181
-- 29000 29999 918
-- 30000 30999 696
-- 31000 31999 493
-- 32000 32999 342
-- 33000 33999 258
-- 34000 34999 172
-- 35000 35999 145
-- 36000 36999 116
-- 37000 37999 66
-- 38000 38999 55
-- 39000 39999 24
-- 40000 40999 33
-- 41000 41999 17
-- 42000 42999 11
-- 43000 43999 9
-- 44000 44999 2
-- 45000 45999 4
-- 46000 46999 3
-- 47000 47999 1
-- 48000 48999 0
-- 49000 49999 0
-- 50000 50999 0
-- 51000 51999 0
-- 52000 52999 0
-- 53000 53999 0
-- 54000 54999 0
-- 55000 55999 0
-- 56000 56999 0
-- 57000 57999 0
-- 58000 58999 0
-- 59000 59999 0
-- 60000 60999 0
-- 61000 61999 1
-- Meryl attempt 1 begins.
--
-- '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6/correction/0-mercounts/meryl.jobSubmit.sh' -> job ended() tasks 1.
--
----------------------------------------
-- Starting command on Tue Jan 10 15:51:11 2017 with 987.024 GB free disk space
bsub \
-w ended() \
-M 4194304 \
-n 1 \
-q 14 \
-J 'canu_canu_troubleshooting_r6' \
-o /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6/canu-scripts/canu.02.out /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6/canu-scripts/canu.02.sh
sh: 1: Syntax error: "(" unexpected
-- Finished on Tue Jan 10 15:51:11 2017 (lickety-split) with 987.024 GB free disk space
----------------------------------------
ERROR:
ERROR: Failed with exit code 2. (rc=512)
ERROR:
================================================================================
Please panic. canu failed, and it shouldn't have.
Stack trace:
at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Execution.pm line 1361.
canu::Execution::caFailure('Failed to submit script', undef) called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Execution.pm line 626
canu::Execution::submitScript('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad...', 'canu_troubleshooting_r6', '-w ended()') called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Execution.pm line 1068
canu::Execution::submitOrRunParallelJob('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad...', 'canu_troubleshooting_r6', 'meryl', '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad...', 'meryl', 1) called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Meryl.pm line 373
canu::Meryl::merylCheck('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad...', 'canu_troubleshooting_r6', 'cor') called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/canu line 470
canu failed with 'Failed to submit script'.
------------------------------------------------------------
Sender: LSF System <lsfadmin@it-c01b11>
Subject: Job 823355: <canu_canu_troubleshooting_r6> Exited
Job <canu_canu_troubleshooting_r6> was submitted from host <tak4> by user <tfallon> in cluster <irken>.
Job was executed on host(s) <it-c01b11>, in queue <14>, as user <tfallon> in cluster <irken>.
</home/tfallon> was used as the home directory.
</lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6> was used as the working directory.
Started at Tue Jan 10 15:43:31 2017
Results reported at Tue Jan 10 15:51:11 2017
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r6/canu-scripts/canu.01.sh
------------------------------------------------------------
Exited with exit code 1.
Resource usage summary:
CPU time : 712.10 sec.
Max Memory : 509 MB
Max Swap : 836 MB
Max Processes : 8
Max Threads : 9
The output (if any) is above this job summary.
The problem seems to be in "sh: 1: Syntax error: "(" unexpected"
The relevant canu.02.sh script doesn't have any "(" or ")"
So I believe the "-w ended()" may be the issue?
Yup, ended() was the problem. Unfortunately, I have no way to test LSF support.
If the commit above doesn't fix it, can you also include 'correction/0-mercounts/meryl.jobSubmit.out'?
Hi Brian,
Almost there I think : ) That -w parameter is still being tricky:
./canu_troubleshooting_r10/canu-scripts/Canu.01.out
-- Canu v1.4 (+54 commits) r8049 7a68912097acc49ae15ebd569d316ee62658c1a3.
-- Detected Java(TM) Runtime Environment '1.8.0_101' (from 'java').
-- Detected gnuplot version '4.6 patchlevel 4' (from 'gnuplot') and image format 'png'.
-- Detected 24 CPUs and 189 gigabytes of memory.
-- Detected LSF with 'bsub' binary in /nfs/lsf8/8.0/linux2.6-glibc2.3-x86_64/bin/bsub.
-- Warning: unknown memory units for grid engine LSF assuming KB
--
-- Found 1 host with 32 cores and 125 GB memory under LSF control.
-- Found 5 hosts with 24 cores and 141 GB memory under LSF control.
-- Found 6 hosts with 32 cores and 251 GB memory under LSF control.
-- Found 1 host with 32 cores and 1007 GB memory under LSF control.
-- Found 1 host with 40 cores and 251 GB memory under LSF control.
-- Found 3 hosts with 24 cores and 377 GB memory under LSF control.
-- Found 1 host with 24 cores and 251 GB memory under LSF control.
-- Found 1 host with 4 cores and 7 GB memory under LSF control.
-- Found 1 host with 120 cores and 1007 GB memory under LSF control.
-- Found 4 hosts with 24 cores and 283 GB memory under LSF control.
-- Found 1 host with 32 cores and 188 GB memory under LSF control.
-- Found 2 hosts with 2 cores and 1 GB memory under LSF control.
-- Found 17 hosts with 24 cores and 188 GB memory under LSF control.
--
-- On LSF detected memory is requested in KB
--
--
-- Allowed to run under grid control, and use up to 8 compute threads and 41 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run under grid control, and use up to 4 compute threads and 7 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run under grid control, and use up to 1 compute thread and 2 GB memory for stage 'overlap error adjustment'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run under grid control, and use up to 1 compute thread and 2 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run under grid control, and use up to 1 compute thread and 16 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run under grid control, and use up to 1 compute thread and 2 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to 4 compute threads and 7 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to 4 compute threads and 7 GB memory for stage 'overlapper'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run under grid control, and use up to 4 compute threads and 7 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #1, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--
-- genomeSize -- 600000000
-- errorRate -- 0.015
--
-- corOvlErrorRate -- 0.045
-- obtOvlErrorRate -- 0.045
-- utgOvlErrorRate -- 0.045
--
-- obtErrorRate -- 0.045
--
-- cnsErrorRate -- 0.045
--
--
-- BEGIN CORRECTION
--
----------------------------------------
-- Starting command on Fri Jan 20 09:39:38 2017 with 1371.253 GB free disk space
/lab/solexa_weng/testtube/canu/Linux-amd64/bin/gatekeeperCreate \
-minlength 1000 \
-o /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10/correction/canu_troubleshooting_r10.gkpStore.BUILDING \
/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10/correction/canu_troubleshooting_r10.gkpStore.gkp \
> /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10/correction/canu_troubleshooting_r10.gkpStore.BUILDING.err 2>&1
-- Finished on Fri Jan 20 09:48:03 2017 (505 seconds) with 1364.575 GB free disk space
----------------------------------------
--
-- In gatekeeper store '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10/correction/canu_troubleshooting_r10.gkpStore':
-- Found 3577683 reads.
-- Found 24690497462 bases (41.15 times coverage).
--
-- Read length histogram (one '*' equals 7095.75 reads):
-- 0 999 0
-- 1000 1999 496703 **********************************************************************
-- 2000 2999 463424 *****************************************************************
-- 3000 3999 391585 *******************************************************
-- 4000 4999 325053 *********************************************
-- 5000 5999 275250 **************************************
-- 6000 6999 241893 **********************************
-- 7000 7999 210930 *****************************
-- 8000 8999 179574 *************************
-- 9000 9999 151367 *********************
-- 10000 10999 128845 ******************
-- 11000 11999 110882 ***************
-- 12000 12999 95748 *************
-- 13000 13999 84241 ***********
-- 14000 14999 78851 ***********
-- 15000 15999 78425 ***********
-- 16000 16999 71847 **********
-- 17000 17999 55859 *******
-- 18000 18999 39660 *****
-- 19000 19999 28196 ***
-- 20000 20999 19683 **
-- 21000 21999 14294 **
-- 22000 22999 9982 *
-- 23000 23999 7128 *
-- 24000 24999 5192
-- 25000 25999 3717
-- 26000 26999 2507
-- 27000 27999 1852
-- 28000 28999 1308
-- 29000 29999 1015
-- 30000 30999 764
-- 31000 31999 537
-- 32000 32999 374
-- 33000 33999 287
-- 34000 34999 183
-- 35000 35999 158
-- 36000 36999 126
-- 37000 37999 70
-- 38000 38999 58
-- 39000 39999 26
-- 40000 40999 36
-- 41000 41999 19
-- 42000 42999 11
-- 43000 43999 10
-- 44000 44999 2
-- 45000 45999 4
-- 46000 46999 3
-- 47000 47999 1
-- 48000 48999 0
-- 49000 49999 1
-- 50000 50999 0
-- 51000 51999 1
-- 52000 52999 0
-- 53000 53999 0
-- 54000 54999 0
-- 55000 55999 0
-- 56000 56999 0
-- 57000 57999 0
-- 58000 58999 0
-- 59000 59999 0
-- 60000 60999 0
-- 61000 61999 1
-- Meryl attempt 1 begins.
--
-- '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10/correction/0-mercounts/meryl.jobSubmit.sh' -> job ended(849749) tasks 1.
--
----------------------------------------
-- Starting command on Fri Jan 20 09:48:33 2017 with 1364.462 GB free disk space
bsub \
-w "1" \
-M 4194304 \
-n 1 \
-q 14 \
-J 'canu_canu_troubleshooting_r10' \
-o /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10/canu-scripts/canu.02.out /lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10/canu-scripts/canu.02.sh
1: No matching job found. Job not submitted.
-- Finished on Fri Jan 20 09:48:33 2017 (lickety-split) with 1364.462 GB free disk space
----------------------------------------
ERROR:
ERROR: Failed with exit code 255. (rc=65280)
ERROR:
================================================================================
Please panic. canu failed, and it shouldn't have.
Stack trace:
at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Execution.pm line 1361.
canu::Execution::caFailure('Failed to submit script', undef) called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Execution.pm line 626
canu::Execution::submitScript('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad...', 'canu_troubleshooting_r10', '-w "1"') called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Execution.pm line 1068
canu::Execution::submitOrRunParallelJob('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad...', 'canu_troubleshooting_r10', 'meryl', '/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad...', 'meryl', 1) called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/lib/canu/Meryl.pm line 373
canu::Meryl::merylCheck('/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad...', 'canu_troubleshooting_r10', 'cor') called at /lab/solexa_weng/testtube/canu/Linux-amd64/bin/canu line 472
canu failed with 'Failed to submit script'.
------------------------------------------------------------
Sender: LSF System <lsfadmin@it-c01b10>
Subject: Job 849747: <canu_canu_troubleshooting_r10> Exited
Job <canu_canu_troubleshooting_r10> was submitted from host <tak4> by user <tfallon> in cluster <irken>.
Job was executed on host(s) <it-c01b10>, in queue <14>, as user <tfallon> in cluster <irken>.
</home/tfallon> was used as the home directory.
</lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10> was used as the working directory.
Started at Fri Jan 20 09:39:36 2017
Results reported at Fri Jan 20 09:48:33 2017
Your job looked like:
------------------------------------------------------------
# LSBATCH: User input
/lab/solexa_weng/Seq_data/Projects/Tim_Fallon/1611Broad/download/destination/SN0107037/analyses/CANU_MAIN/canu_troubleshooting_r10/canu-scripts/canu.01.sh
------------------------------------------------------------
Exited with exit code 1.
Resource usage summary:
CPU time : 831.38 sec.
Max Memory : 87 MB
Max Swap : 176 MB
Max Processes : 8
Max Threads : 9
The output (if any) is above this job summary.
./canu_troubleshooting_r10/correction/0-mercounts/meryl.jobSubmit.out
Job <849749> is submitted to queue <14>.
Thanks for the patience. Give it another try.
Hi Brian,
Seems to be working! The assembler as of now (been running for a few hours) has progressed to canu.08.sh without any errors. Thank you for your patience in figuring out the issue!
Best, -Tim
One last update: The original issue seems to be entirely solved. Canu went through the whole pipeline & produced contigs which are reasonably correct based on other data. Closing the issue based on that.
I did notice that I had some duplicate genes in the contigs, presumably due to heterozygosity as this is a pooled sample. Could you point me towards the most current discussions on tweaking parameters for heterozygous assembly?
Dear Canu developers,
I've run into a canu issue with LSF job submissions that seem to be halting the assembly process, using the most recent version of canu (77d84666cefcadf7f8d6ca5c18fbe1ae60f4c233)
The overall canu job was submitted as follows:
canu seemed to get through canu.01.sh , canu.02.sh, canu.03.sh, and halted on canu.04.sh. I don't think there is anything malformed in the reads. Log ("canu.out") is below:
The issue seems to be with the second bsub submission with the "-w "ended("cormhap_canu_troubleshooting_r1[1-77]")" " parameter. Perhaps the formatting of the parameter is incorrect, or perhaps it needs to be submitted while the jobs are present in the queue (perhaps it is being submitted too early or too late, before LSF knows about the jobs, or after it forgets about them?)
Oddly enough, despite the parent canu script exiting & giving the log shown above, something was submitted, as I'm still seeing jobs on the cluster from canu:
Thoughts? Unfortunately I'm not too familiar with array jobs on LSF, but would be happy to run any troubleshooting needed.
All the best, -Tim