marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
658 stars 179 forks source link

canu failed with 'canu iteration count too high, stopping pipeline (most likely a problem in the grid-based computes)' #418

Closed jjoets closed 7 years ago

jjoets commented 7 years ago

Hi,

I run this command lines

module load bioinfo/Java8
/work/project/amaizing/bin/canu/Linux-amd64/bin/canu -s /work/project/amaizing/PacBio_FV2/canu.spec 
-p F2PacBio1 -d F2PacBio1 genomeSize=2.3g 
-pacbio-raw '/work/project/amaizing/PacBio_FV2/run_01262016.fastq.gz' 
-pacbio-raw '/work/project/amaizing/PacBio_FV2/run_02252016.fastq.gz' 
-pacbio-raw '/work/project/amaizing/PacBio_FV2/run_03112016.fastq.gz' 
-pacbio-raw '/work/project/amaizing/PacBio_FV2/run_02292016.fastq.gz' 
-pacbio-raw '/work/project/amaizing/PacBio_FV2/run_03042016.fastq.gz' 
-pacbio-raw '/work/project/amaizing/PacBio_FV2/run_12152015.fastq.gz' 
-pacbio-raw '/work/project/amaizing/PacBio_FV2/run_10112016.fastq.gz' 
-pacbio-raw '/work/project/amaizing/PacBio_FV2/run_09212016.fastq.gz' 
-pacbio-raw '/work/project/amaizing/PacBio_FV2/run_10192016.fastq.gz' 
-pacbio-raw '/work/project/amaizing/PacBio_FV2/run_08312016.fastq.gz' 
-pacbio-raw '/work/project/amaizing/PacBio_FV2/run_05092016.fastq.gz'
 -pacbio-raw '/work/project/amaizing/PacBio_FV2/run_05202016.fastq.gz' 
-pacbio-raw '/work/project/amaizing/PacBio_FV2/run_05132016.fastq.gz' 
-pacbio-raw '/work/project/amaizing/PacBio_FV2/run_05062016.fastq.gz'

with this .spec file

gridEngineArrayOption=-t 1-100
maxMemory=200
maxThreads=15
gridOptions=-V -q workq -l h_vmem=500G -l myarch=intel
gridEngineThreadsOption=-pe parallel_smp THREADS
gridEngineMemoryOption=-l mem=MEMORY

In a first attempt, I ran Canu without the gridEngineArrayOption=-t 1-100 option but the cluster was close to fall due to high io . So we killed the job, set the gridEngineArrayOption option and run Canu again (same command line).

But now I get :

-- Canu v1.4 (+126 commits) r8121 35a6573af22a9388f357fc392960fe92d95b1665.
-- Detected Java(TM) Runtime Environment '1.8.0_45' (from 'java').
-- Detected gnuplot version '4.2 patchlevel 6 ' (from 'gnuplot') and image format 'png'.
-- Detected 40 CPUs and 252 gigabytes of memory.
-- Limited to 200 gigabytes from maxMemory option.
-- Limited to 15 CPUs from maxThreads option.
-- Detected Sun Grid Engine in '/SGE/ogs/default'.
-- User supplied Grid Engine environment '-pe parallel_smp THREADS'.
-- User supplied Grid Engine consumable '-l mem=MEMORY'.
-- 
-- Found   1 host  with  48 cores and  386 GB memory under Sun Grid Engine control.
-- Found   1 host  with  64 cores and 1009 GB memory under Sun Grid Engine control.
-- Found   1 host  with  48 cores and  252 GB memory under Sun Grid Engine control.
-- Found   1 host  with  48 cores and  504 GB memory under Sun Grid Engine control.
-- Found   1 host  with 240 cores and 2932 GB memory under Sun Grid Engine control.
-- Found   1 host  with  48 cores and  370 GB memory under Sun Grid Engine control.
-- Found  29 hosts with  48 cores and  378 GB memory under Sun Grid Engine control.
-- Found  68 hosts with  40 cores and  252 GB memory under Sun Grid Engine control.
--
-- Run under grid control using   84 GB and  12 CPUs for stage 'meryl'.
-- Run under grid control using   42 GB and   8 CPUs for stage 'mhap (cor)'.
-- Run under grid control using   16 GB and   8 CPUs for stage 'overlapper (obt)'.
-- Run under grid control using   16 GB and   8 CPUs for stage 'overlapper (utg)'.
-- Run under grid control using   21 GB and   4 CPUs for stage 'falcon_sense'.
-- Run under grid control using    4 GB and   1 CPU  for stage 'ovStore bucketizer'.
-- Run under grid control using   32 GB and   1 CPU  for stage 'ovStore sorting'.
-- Run under grid control using   16 GB and   8 CPUs for stage 'read error detection'.
-- Run under grid control using    4 GB and   1 CPU  for stage 'overlap error adjustment'.
-- Run under grid control using  128 GB and  15 CPUs for stage 'bogart'.
-- Run under grid control using   64 GB and   8 CPUs for stage 'consensus'.
--
-- Generating assembly 'F2PacBio1' in '/work/project/amaizing/PacBio_FV2/F2PacBio1'
--
-- Parameters:
--
--  genomeSize        2300000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.0450 (  4.50%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0450 (  4.50%)
--
--
-- BEGIN CORRECTION
--
--
-- 859 overlap store sorter jobs failed:
--   job correction/F2PacBio1.ovlStore.BUILDING/1070 FAILED.
--   job correction/F2PacBio1.ovlStore.BUILDING/1075 FAILED.
./.
--   job correction/F2PacBio1.ovlStore.BUILDING/1997 FAILED.
--   job correction/F2PacBio1.ovlStore.BUILDING/1998 FAILED.
--   job correction/F2PacBio1.ovlStore.BUILDING/1999 FAILED.
--
================================================================================
Don't panic, but a mostly harmless error occurred and canu failed.

canu failed with 'canu iteration count too high, stopping pipeline (most likely a problem in the grid-based computes)'.

Your job has been killed.
This may happen if one of the followings hold : 
- you exceeded one of the queue/job limits (run time, memory, etc)
- you (or admin) killed the job using qdel 
- something bad happened. 
Now, just in case something bad happened, here are the debug information about your job : 
total 0
JOB INFO : 
total 44
-rw-r--r-- 1 sgeadmin sgeadmin    34 22 mars  14:53 pe_hostfile
-rw-r--r-- 1 sgeadmin sgeadmin 11816 22 mars  14:53 environment
-rw-r--r-- 1 sgeadmin sgeadmin  2379 22 mars  14:53 config
-rw-r--r-- 1 sgeadmin sgeadmin     6 22 mars  14:53 pid
-rw-r--r-- 1 sgeadmin sgeadmin     6 22 mars  14:53 job_pid
-rw-r--r-- 1 jjoets   UMRGQE       0 22 mars  14:53 exit_status
-rw-r--r-- 1 jjoets   UMRGQE       0 22 mars  14:53 error
-rw-r--r-- 1 sgeadmin sgeadmin     6 22 mars  14:53 addgrpid
-rw-r--r-- 1 sgeadmin sgeadmin   318 22 mars  14:53 usage
-rw-r--r-- 1 jjoets   UMRGQE    4789 22 mars  14:53 trace
STATUS : 
TRACE : 
03/22/2017 14:53:38 [500:56788]: shepherd called with uid = 0, euid = 500
03/22/2017 14:53:38 [500:56788]: starting up 2011.11p1
03/22/2017 14:53:38 [500:56788]: setpgid(56788, 56788) returned 0
03/22/2017 14:53:38 [500:56788]: do_core_binding: "binding" parameter not found in config file
03/22/2017 14:53:38 [500:56788]: no prolog script to start
03/22/2017 14:53:38 [500:56788]: no pe_start script to start
03/22/2017 14:53:38 [500:56788]: parent: forked "job" with pid 56789
03/22/2017 14:53:38 [500:56788]: parent: job-pid: 56789
03/22/2017 14:53:38 [500:56789]: child: starting son(job, /var/spool/sge/node011/job_scripts/8359737, 0);
03/22/2017 14:53:38 [500:56789]: pid=56789 pgrp=56789 sid=56789 old pgrp=56788 getlogin()=<no login set>
03/22/2017 14:53:38 [500:56789]: reading passwd information for user 'jjoets'
03/22/2017 14:53:38 [500:56789]: setosjobid: uid = 0, euid = 500
03/22/2017 14:53:38 [500:56789]: setting limits
03/22/2017 14:53:38 [500:56789]: RLIMIT_CPU setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY))
03/22/2017 14:53:38 [500:56789]: RLIMIT_FSIZE setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY))
03/22/2017 14:53:38 [500:56789]: RLIMIT_DATA setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY))
03/22/2017 14:53:38 [500:56789]: RLIMIT_STACK setting: (soft 268435456, hard 268435456) resulting: (soft 268435456, hard 268435456)
03/22/2017 14:53:38 [500:56789]: RLIMIT_CORE setting: (soft 0, hard 0) resulting: (soft 0, hard 0)
03/22/2017 14:53:38 [500:56789]: RLIMIT_MEMLOCK setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY))
03/22/2017 14:53:38 [500:56789]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 536870912000, hard 536870912000) resulting: (soft 536870912000, hard 536870912000)
03/22/2017 14:53:38 [500:56789]: RLIMIT_RSS setting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY)) resulting: (soft 18446744073709551615(INFINITY), hard 18446744073709551615(INFINITY))
03/22/2017 14:53:38 [500:56789]: setting environment
03/22/2017 14:53:38 [500:56789]: Initializing error file
03/22/2017 14:53:38 [500:56789]: switching to intermediate/target user
03/22/2017 14:53:38 [14241:56789]: closing all filedescriptors
03/22/2017 14:53:38 [14241:56789]: further messages are in "error" and "trace"
03/22/2017 14:53:38 [14241:56789]: using stdout as stderr
03/22/2017 14:53:38 [14241:56789]: now running with uid=14241, euid=14241
03/22/2017 14:53:38 [14241:56789]: execvp(/var/spool/sge/node011/job_scripts/8359737, "/var/spool/sge/node011/job_scripts/8359737")
03/22/2017 14:53:40 [500:56788]: wait3 returned 56789 (status: 256; WIFSIGNALED: 0,  WIFEXITED: 1, WEXITSTATUS: 1)
03/22/2017 14:53:40 [500:56788]: job exited with exit status 1
03/22/2017 14:53:40 [500:56788]: reaped "job" with pid 56789
03/22/2017 14:53:40 [500:56788]: job exited not due to signal
03/22/2017 14:53:40 [500:56788]: job exited with status 1
03/22/2017 14:53:40 [500:56788]: now sending signal KILL to pid -56789
03/22/2017 14:53:40 [500:56788]: writing usage file to "usage"
03/22/2017 14:53:40 [500:56788]: no tasker to notify
03/22/2017 14:53:40 [500:56788]: no pe_stop script to start
03/22/2017 14:53:40 [500:56788]: parent: forked "epilog" with pid 57275
03/22/2017 14:53:40 [500:56788]: using signal delivery delay of 120 seconds
03/22/2017 14:53:40 [500:56788]: parent: epilog-pid: 57275
03/22/2017 14:53:40 [500:57275]: child: starting son(epilog, /SGE/ogs/inra/tools/epilog_verbose.sh, 0);
03/22/2017 14:53:40 [500:57275]: pid=57275 pgrp=57275 sid=57275 old pgrp=56788 getlogin()=<no login set>
03/22/2017 14:53:40 [500:57275]: reading passwd information for user 'jjoets'
03/22/2017 14:53:40 [500:57275]: setting limits
03/22/2017 14:53:40 [500:57275]: setting environment
03/22/2017 14:53:40 [500:57275]: Initializing error file
03/22/2017 14:53:40 [500:57275]: switching to intermediate/target user
03/22/2017 14:53:40 [14241:57275]: closing all filedescriptors
03/22/2017 14:53:40 [14241:57275]: further messages are in "error" and "trace"
03/22/2017 14:53:40 [14241:57275]: using "/bin/bash" as shell of user "jjoets"
03/22/2017 14:53:40 [14241:57275]: using stdout as stderr
03/22/2017 14:53:40 [14241:57275]: now running with uid=14241, euid=14241
03/22/2017 14:53:40 [14241:57275]: execvp(/SGE/ogs/inra/tools/epilog_verbose.sh, "/SGE/ogs/inra/tools/epilog_verbose.sh")
ERROR : 
Messages file :
End debugging information.
Epilog : job finished at mer. mars 22 14:53:40 CET 2017

Thanks for your help

Johann

skoren commented 7 years ago

You shouldn't set the gridEngineArrayOption=-t 1-100, that implies all submissions should have 100 jobs which is not correct. Canu will decide the number of jobs to submit based on your coverage , genome size, and available processes. If you want to limit concurrency you want to use the -tc option (gridEngineArrayOption=`-t ARRAY_JOBS -tc 100. This will limit all array jobs to at most 100 instances. You could also use a parallel environment which restricts the number of CPUs which can be assigned at one time.

Now for the error, what's the output of one of the failed jobs in (something like correction/asm.ovlStore.BUILDING/logs/2-sort.1070.out)?

jjoets commented 7 years ago

Thanks a lot for your advices. I updated the .spec file and re-started Canu which looks to run fine now. Wait and see.

Unfortunately, the directory ovlStore.BUILDING has disappeared. Sorry, but I cannot copy the file content. However as far as i remember the logs were fine and suddenly were interrupted by a SGE message similar to what I reported yesterday.

Johann

skoren commented 7 years ago

Time limit or memory would be the likeliest reason for the jobs being terminated. You could try specifying a longer time limit using gridOptions which gets passed through to all jobs.

Glad the resume is working for you, if you encounter other errors please open a new issue.