dcopetti commented 8 years ago

Hello, We are having issues on running Canu on our cluster, the error says: canu failed with 'can't configure for SGE'.

The command is: /opt/canu/Linux-amd64/bin/canu -d /home/../250k_assembly -p test250k genomeSize=380m corMinCoverage=2 errorRate=0.18 -pacbio-raw input_subreads.fa and we have a PSSC cluster with 4 nodes, each with 12 core and 32 GB Ram, so a total of 48 cores and 128GB Ram. The SGE is GE 6.2u5p3 and it has CentOS release 6.5 (Final).

Following the documentation, we tested all the 6x2 options at Grid Engine Configuration (here one example) 754 $global{"gridEngineThreadsOption"} ="-pe make THREADS"; 755 $global{"gridEngineMemoryOption"} ="-l h_vmem=MEMORY"; but always got the same error.

How do we set up the assembler to run on our cluster? Will we then be able to modulate the resources (cores, cpus, memory), so that we can have other processes running at the same time? Thanks

brianwalenz commented 8 years ago

Don't change the values in Defaults.pm, those should be passed in on the command line. If you really want to change the code, make the change in Grid_SGE.pm.

We used to allow (in Celera Assembler) a dotfile to hold common settings. It's not enabled in canu at the moment, for reasons I can't remember. It read either a dotfile in yout home directory, or one in the binary directory. I'll fix that tomorrow.

skoren commented 8 years ago

This is most likely because Canu can't find the appropriate ways to request memory on your SGE configuration. You can control these by adding the options:

gridEngineMemoryOption="-l h_vmem=MEMORY" gridEngineThreadsOption="-pe make THREADS"

to your command line. As Brian said, you shouldn't need to modify the code.

That said, I don't think either h_vmem or make are options you should use. At least on systems I've seen h_vmem is not a consumable resource which means two jobs requesting 60G at the same time could get scheduled on a single 60G machine and then try to use a total of 120G, bringing the machine down. You want a memory parameter that locks the memory for a process, if you don't have an option like that you would need to add one. You can check the available memory options using:

% qconf -sc|grep MEMORY
#name                     shortcut        type        relop   requestable consumable default  urgency 
#-----------------------------------------------------------------------------------------------------
h_vmem                    h_vmem          MEMORY      <=      YES         NO         0        0
mem_free                  mf              MEMORY      <=      YES         YES        0        0

The make parallel environment is also usually not configured for multi-threaded jobs. On our system:

% qconf -sp make
pe_name            make
slots              999
user_lists         NONE
xuser_lists        NONE
start_proc_args    NONE
stop_proc_args     NONE
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary TRUE
qsort_args         NONE

Round robin will assign slots from different machines to the job: $round_robin - select one slot from each host in a round-robin fashion until all job slots are assigned. This setting can result in more than one job slot per host.

You would want pe_slots which ensures all jobs are on the same machine. $pe_slots - place all the job slots on a single machine. Grid Engine will only schedule such a job to a machine that can host the maximum number of slots requested by the job.

dcopetti commented 8 years ago

Hi, I tried to add your parameters, and it gives a longer error:

With -pe make THREADS [smrtanalysis@pac canu_test]$ /opt/canu/Linux-amd64/bin/canu -d /../250k_assembly -p N22_test250k genomeSize=380m corMinCoverage=2 gridEngineMemoryOption="-l h_vmem=MEMORY" gridEngineThreadsOption="-pe make THREADS" errorRate=0.18 -pacbio-raw N22_42cells_250k_subreads.fa -- Detected Java(TM) Runtime Environment '1.8.0_66' (from 'java'). -- Detected 12 CPUs and 31 gigabytes of memory. -- Detected Sun Grid Engine in '/usr/share/gridengine/default'. -- User supplied Grid Engine environment '-pe make THREADS'.

-- User supplied Grid Engine consumable '-l h_vmem=MEMORY'.

-- Found 4 hosts with 12 cores and 31 GB memory under Sun Grid

Engine control.

-- Allowed to run under grid control, and use up to 6 compute threads and 15 GB memory for stage 'bogart (unitigger)'. -- Allowed to run under grid control, and use up to 12 compute threads and 13 GB memory for stage 'mhap (overlapper)'. -- Allowed to run under grid control, and use up to 12 compute threads and 13 GB memory for stage 'mhap (overlapper)'. -- Allowed to run under grid control, and use up to 12 compute threads and 13 GB memory for stage 'mhap (overlapper)'. -- Allowed to run under grid control, and use up to 6 compute threads and 12 GB memory for stage 'read error detection (overlap error adjustment)'. -- Allowed to run under grid control, and use up to 1 compute thread
and 2 GB memory for stage 'overlap error adjustment'. -- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'utgcns (consensus'. -- Allowed to run under grid control, and use up to 1 compute thread
and 8 GB memory for stage 'overlap store sequential building'. -- Allowed to run under grid control, and use up to 1 compute thread
and 2 GB memory for stage 'overlap store parallel bucketizer'. -- Allowed to run under grid control, and use up to 1 compute thread
and 10 GB memory for stage 'overlap store parallel sorting'. -- Allowed to run under grid control, and use up to 1 compute thread
and 2 GB memory for stage 'overlapper'. -- Allowed to run under grid control, and use up to 6 compute threads and 8 GB memory for stage 'overlapper'. -- Allowed to run under grid control, and use up to 6 compute threads and 8 GB memory for stage 'overlapper'. -- Allowed to run under grid control, and use up to 12 compute threads and 31 GB memory for stage 'meryl (k-mer counting)'. -- Allowed to run under grid control, and use up to 6 compute threads

and 15 GB memory for stage 'falcon_sense (read correction)'.

-- Starting command on Wed Feb 3 09:11:12 2016 with 339.3 GB free disk space qsub \ -l h_vmem=12g \ -pe make 1 \ -cwd \ -N "canu_N22_test250k" \ -j y \ -o /home/smrtanalysis/dario_test/canu_test/250k_assembly/canu-scripts/canu.01.out /home/smrtanalysis/dario_test/canu_test/250k_assembly/canu-scripts/canu.01.sh Unable to run job: job rejected: the requested parallel environment "make" does not exist. Exiting.

-- Finished on Wed Feb 3 09:11:12 2016 (lickety-split) with 339.3 GB

free disk space

ERROR: Failed with signal HUP (1)

Please panic. canu failed, and it shouldn't have. Stack trace: at /opt/canu/Linux-amd64/bin/lib/canu/Defaults.pm line 230. canu::Defaults::caFailure("Failed to submit script", undef) called at /opt/canu/Linux-amd64/bin/lib/canu/Execution.pm line 851 canu::Execution::submitScript("/home/smrtanalysis/dario_test/canu_test/250k_assembly", "N22_test250k", undef) called at /opt/canu/Linux-amd64/bin/canu line 312

canu failed with 'Failed to submit script'.

also with a different option for the -pe: ###########################

with -pe thread THREADS: [smrtanalysis@pac canu_test]$ /opt/canu/Linux-amd64/bin/canu -d /../250k_assembly -p N22_test250k genomeSize=380m corMinCoverage=2 gridEngineMemoryOption="-l h_vmem=MEMORY" gridEngineThreadsOption="-pe thread THREADS" errorRate=0.18 -pacbio-raw N22_42cells_250k_subreads.fa -- Detected Java(TM) Runtime Environment '1.8.0_66' (from 'java'). -- Detected 12 CPUs and 31 gigabytes of memory. -- Detected Sun Grid Engine in '/usr/share/gridengine/default'. -- User supplied Grid Engine environment '-pe thread THREADS'.

-- User supplied Grid Engine consumable '-l h_vmem=MEMORY'.

-- Found 4 hosts with 12 cores and 31 GB memory under Sun Grid

Engine control.

-- Allowed to run under grid control, and use up to 6 compute threads and 15 GB memory for stage 'bogart (unitigger)'. -- Allowed to run under grid control, and use up to 12 compute threads and 13 GB memory for stage 'mhap (overlapper)'. -- Allowed to run under grid control, and use up to 12 compute threads and 13 GB memory for stage 'mhap (overlapper)'. -- Allowed to run under grid control, and use up to 12 compute threads and 13 GB memory for stage 'mhap (overlapper)'. -- Allowed to run under grid control, and use up to 6 compute threads and 12 GB memory for stage 'read error detection (overlap error adjustment)'. -- Allowed to run under grid control, and use up to 1 compute thread
and 2 GB memory for stage 'overlap error adjustment'. -- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'utgcns (consensus'. -- Allowed to run under grid control, and use up to 1 compute thread
and 8 GB memory for stage 'overlap store sequential building'. -- Allowed to run under grid control, and use up to 1 compute thread
and 2 GB memory for stage 'overlap store parallel bucketizer'. -- Allowed to run under grid control, and use up to 1 compute thread
and 10 GB memory for stage 'overlap store parallel sorting'. -- Allowed to run under grid control, and use up to 1 compute thread
and 2 GB memory for stage 'overlapper'. -- Allowed to run under grid control, and use up to 6 compute threads and 8 GB memory for stage 'overlapper'. -- Allowed to run under grid control, and use up to 6 compute threads and 8 GB memory for stage 'overlapper'. -- Allowed to run under grid control, and use up to 12 compute threads and 31 GB memory for stage 'meryl (k-mer counting)'. -- Allowed to run under grid control, and use up to 6 compute threads

and 15 GB memory for stage 'falcon_sense (read correction)'.

-- Starting command on Wed Feb 3 09:14:40 2016 with 339.3 GB free disk space qsub \ -l h_vmem=12g \ -pe thread 1 \ -cwd \ -N "canu_N22_test250k" \ -j y \ -o /home/smrtanalysis/dario_test/canu_test/250k_assembly/canu-scripts/canu.01.out /home/smrtanalysis/dario_test/canu_test/250k_assembly/canu-scripts/canu.01.sh Unable to run job: job rejected: the requested parallel environment "thread" does not exist. Exiting. -- Finished on Wed Feb 3 09:14:40 2016 (lickety-split) with 339.3 GB

free disk space

ERROR: Failed with signal HUP (1)

Please panic. canu failed, and it shouldn't have.

Stack trace: at /opt/canu/Linux-amd64/bin/lib/canu/Defaults.pm line 230. canu::Defaults::caFailure("Failed to submit script", undef) called at /opt/canu/Linux-amd64/bin/lib/canu/Execution.pm line 851 canu::Execution::submitScript("/home/smrtanalysis/dario_test/canu_test/250k_assembly", "N22_test250k", undef) called at /opt/canu/Linux-amd64/bin/canu line 312

canu failed with 'Failed to submit script'.

Our system does not have mpi or smp. Thanks,

Dario

On 02/03/2016 08:10 AM, Sergey Koren wrote:

This is most likely because Canu can't find the appropriate ways to request memory on your SGE configuration. You can control these by adding the options:

|gridEngineMemoryOption="-l h_vmem=MEMORY" gridEngineThreadsOption="-pe make THREADS" |

to your command line. As Brian said, you shouldn't need to modify the code.

That said, I don't think either h_vmem or make are options you should use. At least on systems I've seen h_vmem is not a consumable resource which means two jobs requesting 60G at the same time could get scheduled on a single 60G machine and then try to use a total of 120G, bringing the machine down. You want a memory parameter that locks the memory for a process, if you don't have an option like that you would need to add one. You can check the available memory options using:

|% qconf -sc|grep MEMORY #name shortcut type relop requestable consumable default urgency

-----------------------------------------------------------------------------------------------------

h_vmem h_vmem MEMORY <= YES NO 0 0 mem_free mf MEMORY <= YES YES 0 0 |

The make parallel environment is also usually not configured for multi-threaded jobs. On our system:

|% qconf -sp make pe_name make slots 999 user_lists NONE xuser_lists NONE start_proc_args NONE stop_proc_args NONE allocation_rule $round_robin control_slaves TRUE job_is_first_task FALSE urgency_slots min accounting_summary TRUE qsort_args NONE |

Round robin will assign slots from different machines to the job: $round_robin - select one slot from each host in a round-robin fashion until all job slots are assigned. This setting can result in more than one job slot per host.

You would want pe_slots which ensures all jobs are on the same machine. $pe_slots - place all the job slots on a single machine. Grid Engine will only schedule such a job to a machine that can host the maximum number of slots requested by the job.

— Reply to this email directly or view it on GitHub https://github.com/marbl/canu/issues/40#issuecomment-179285730.

Dario Copetti, PhD Research Associate | Arizona Genomics Institute University of Arizona | BIO5

1657 E. Helen St. Tucson, AZ 85721, USA www.genome.arizona.edu

skoren commented 8 years ago

The error message is listed in the canu output. Your qsub command did not accept the pe make option, it says that parallel environment does not exist:

Unable to run job: job rejected: the requested parallel environment
"make" does not exist.

Your system must not have make defined. What parallel environments do you have, you can check with

% qconf -spl
make
make-dedicated
thread

You can check check each to see if it does pe_slots scheduling with qconf -sp . If you don't have any that has pe_slots scheduling then you would need to add one otherwise there is no way for a multi-threaded program to run on you cluster on a single node.

Also, the errorRate is an optional parameter and is the error in the corrected reads not raw input data so 0.18 is too high. I'd leave it as default or maybe 0.035 as suggested for low coverage in the documentation.

dcopetti commented 8 years ago

I run your commands, and we actually have smp: [smrtanalysis@pac canu_test]$ qconf -spl smp [smrtanalysis@pac canu_test]$ qconf -sp smp pe_name smp slots 200 user_lists NONE xuser_lists NONE start_proc_args /bin/true stop_proc_args /bin/true allocation_rule $pe_slots control_slaves FALSE job_is_first_task TRUE urgency_slots min accounting_summary FALSE qconf -sc gives along list: [smrtanalysis@pac canu_test]$ qconf -sc|grep MEMORY h_core h_core MEMORY <= YES NO 0 0 h_data h_data MEMORY <= YES NO 0 0 h_fsize h_fsize MEMORY <= YES NO 0 0 h_rss h_rss MEMORY <= YES NO 0 0 h_stack h_stack MEMORY <= YES NO 0 0 h_vmem h_vmem MEMORY <= YES NO 0 0 mem_free mf MEMORY <= YES NO 0 0 mem_total mt MEMORY <= YES NO 0 0 mem_used mu MEMORY >= YES NO 0 0 s_core s_core MEMORY <= YES NO 0 0 s_data s_data MEMORY <= YES NO 0 0 s_fsize s_fsize MEMORY <= YES NO 0 0 s_rss s_rss MEMORY <= YES NO 0 0 s_stack s_stack MEMORY <= YES NO 0 0 s_vmem s_vmem MEMORY <= YES NO 0 0 swap_free sf MEMORY <= YES NO 0 0 swap_rate sr MEMORY >= YES NO 0 0 swap_rsvd srsv MEMORY >= YES NO 0 0 swap_total st MEMORY <= YES NO 0 0 swap_used su MEMORY >= YES NO 0 0 virtual_free vf MEMORY <= YES NO 0 0 virtual_total vt MEMORY <= YES NO 0 0 virtual_used vu MEMORY >= YES NO 0 0

so if I run the command with smp it says it has been submitted, there is no activity on the cluster and it actually finishes right away (the -d folder contains canu-logs and canuscripts subfolders): [smrtanalysis@pac canu_test]$ /opt/canu/Linux-amd64/bin/canu -d /home/smrtanalysis/dario_test/canu_test/250k_assembly -p N22_test250k genomeSize=380m corMinCoverage=2 gridEngineMemoryOption="-l h_vmem=MEMORY" gridEngineThreadsOption="-pe smp THREADS" errorRate=0.18 -pacbio-raw N22_42cells_250k_subreads.fa -- Detected Java(TM) Runtime Environment '1.8.0_66' (from 'java'). -- Detected 12 CPUs and 31 gigabytes of memory. -- Detected Sun Grid Engine in '/usr/share/gridengine/default'. -- User supplied Grid Engine environment '-pe smp THREADS'.

-- User supplied Grid Engine consumable '-l h_vmem=MEMORY'.

-- Found 4 hosts with 12 cores and 31 GB memory under Sun Grid

Engine control.

-- Allowed to run under grid control, and use up to 6 compute threads and 15 GB memory for stage 'bogart (unitigger)'. -- Allowed to run under grid control, and use up to 12 compute threads and 13 GB memory for stage 'mhap (overlapper)'. -- Allowed to run under grid control, and use up to 12 compute threads and 13 GB memory for stage 'mhap (overlapper)'. -- Allowed to run under grid control, and use up to 12 compute threads and 13 GB memory for stage 'mhap (overlapper)'. -- Allowed to run under grid control, and use up to 6 compute threads and 12 GB memory for stage 'read error detection (overlap error adjustment)'. -- Allowed to run under grid control, and use up to 1 compute thread
and 2 GB memory for stage 'overlap error adjustment'. -- Allowed to run under grid control, and use up to 8 compute threads and 31 GB memory for stage 'utgcns (consensus'. -- Allowed to run under grid control, and use up to 1 compute thread
and 8 GB memory for stage 'overlap store sequential building'. -- Allowed to run under grid control, and use up to 1 compute thread
and 2 GB memory for stage 'overlap store parallel bucketizer'. -- Allowed to run under grid control, and use up to 1 compute thread
and 10 GB memory for stage 'overlap store parallel sorting'. -- Allowed to run under grid control, and use up to 1 compute thread
and 2 GB memory for stage 'overlapper'. -- Allowed to run under grid control, and use up to 6 compute threads and 8 GB memory for stage 'overlapper'. -- Allowed to run under grid control, and use up to 6 compute threads and 8 GB memory for stage 'overlapper'. -- Allowed to run under grid control, and use up to 12 compute threads and 31 GB memory for stage 'meryl (k-mer counting)'. -- Allowed to run under grid control, and use up to 6 compute threads

and 15 GB memory for stage 'falcon_sense (read correction)'.

-- Starting command on Wed Feb 3 10:00:04 2016 with 339.3 GB free disk space qsub \ -l h_vmem=12g \ -pe smp 1 \ -cwd \ -N "canu_N22_test250k" \ -j y \ -o /home/smrtanalysis/dario_test/canu_test/250k_assembly/canu-scripts/canu.01.out /home/smrtanalysis/dario_test/canu_test/250k_assembly/canu-scripts/canu.01.sh Your job 65592 ("canu_N22_test250k") has been submitted -- Finished on Wed Feb 3 10:00:04 2016 (lickety-split) with 339.3 GB free disk space

If it can help, canu-scripts/canu.01.out says: Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. if: Expression Syntax.

Here they explain how to invoke smp: http://bioinformatics.mdc-berlin.de/intro2UnixandSGE/sun_grid_engine_for_beginners/parallel_environments.html and if I run it with 4, it gives this error: [smrtanalysis@pac canu_test]$ /opt/canu/Linux-amd64/bin/canu -d /home/smrtanalysis/dario_test/canu_test/250k_assembly -p N22_test250k genomeSize=380m corMinCoverage=2 gridEngineMemoryOption="-l h_vmem=MEMORY" gridEngineThreadsOption="-pe smp 4" errorRate=0.18 -pacbio-raw N22_42cells_250k_subreads.fa -- Detected Java(TM) Runtime Environment '1.8.0_66' (from 'java'). -- Detected 12 CPUs and 31 gigabytes of memory.

-- Detected Sun Grid Engine in '/usr/share/gridengine/default'.

Please panic. canu failed, and it shouldn't have. Stack trace: at /opt/canu/Linux-amd64/bin/lib/canu/Defaults.pm line 230. canu::Defaults::caFailure("Couldn't parse gridEngineThreadsOption='-pe smp 4'", undef) called at /opt/canu/Linux-amd64/bin/lib/canu/Grid_SGE.pm line 126 canu::Grid_SGE::configureSGE() called at /opt/canu/Linux-amd64/bin/canu line 269 canu failed with 'Couldn't parse gridEngineThreadsOption='-pe smp 4''. maybe it is not the right smp.

Any suggestion is welcome. Thanks,

Dario

On 02/03/2016 09:37 AM, Sergey Koren wrote:

The error message is listed in the canu output. Your qsub command did not accept the pe make option, it says that parallel environment does not exist:

|Unable to run job: job rejected: the requested parallel environment "make" does not exist. |

Your system must not have make defined. What parallel environments do you have, you can check with

|% qconf -spl make make-dedicated thread |

You can check check each to see if it does pe_slots scheduling with qconf -sp . If you don't have any that has pe_slots scheduling then you would need to add one otherwise there is no way for a multi-threaded program to run on you cluster on a single node.

— Reply to this email directly or view it on GitHub https://github.com/marbl/canu/issues/40#issuecomment-179330761.

Dario Copetti, PhD Research Associate | Arizona Genomics Institute University of Arizona | BIO5

1657 E. Helen St. Tucson, AZ 85721, USA www.genome.arizona.edu

skoren commented 8 years ago

Your first option (with THREADS, not 4) is correct. The error you're getting:

if: Expression Syntax.

is issue #21. Until it's fixed you need to explicitly tell your SGE scheduler to run the jobs under bash. You can do this by adding:

gridOptions="-S /bin/bash"

or whatever the path to your bash is.

dcopetti commented 8 years ago

we did some progress. First, logged in to a specific node (qlogin -l h=n002), then after making sure that we have the right java version, we launched the command [smrtanalysis@n002 ~]$ /opt/canu/Linux-amd64/bin/canu -d /home/smrtanalysis/dario_test/canu_test/250k_assembly -p N22_test250k genomeSize=380m corMinCoverage=2 gridEngineMemoryOption="-l h_vmem=MEMORY" gridEngineThreadsOption="-pe smp THREADS" gridOptions="-S /bin/bash" errorRate=0.18 -pacbio-raw /home/smrtanalysis/dario_test/canu_test/N22_42cells_250k_subreads.fa that printed these lines: -- Starting command on Wed Feb 3 13:21:28 2016 with 339.3 GB free disk space qsub \ -l h_vmem=12g \ -pe smp 1 \ -S /bin/bash \ -cwd \ -N "canu_N22_test250k" \ -j y \ -o /home/smrtanalysis/dario_test/canu_test/250k_assembly/canu-scripts/canu.02.out /home/smrtanalysis/dario_test/canu_test/250k_assembly/canu-scripts/canu.02.sh Your job 65597 ("canu_N22_test250k") has been submitted -- Finished on Wed Feb 3 13:21:28 2016 (lickety-split) with 339.3 GB free disk space The output folder contains canu-logs, canu-scripts and correction folders and correction.html file. The 0-mercounts folder has .mcdat and .mcidx files, and canu-logs has gatekeeper logs.

Looks like we are moving ahead a bit :-)

Dario

On 02/03/2016 10:46 AM, Sergey Koren wrote:

Your first option (with THREADS, not 4) is correct. The error you're getting:

|if: Expression Syntax. |

is issue #21 https://github.com/marbl/canu/issues/21. Until it's fixed you need to explicitly tell your SGE scheduler to run the jobs under bash. You can do this by adding:

|gridOptions="-S /bin/bash" |

or whatever the path to your bash is.

— Reply to this email directly or view it on GitHub https://github.com/marbl/canu/issues/40#issuecomment-179370919.

Dario Copetti, PhD Research Associate | Arizona Genomics Institute University of Arizona | BIO5

1657 E. Helen St. Tucson, AZ 85721, USA www.genome.arizona.edu

skoren commented 8 years ago

The java version is different on the head node environment versus the qsubed job? You can add -V to your gridOptions line which should force the environment to be preserved in your submitted command.

dcopetti commented 8 years ago

With gridOptions="-V -S /bin/bash" in the command, it still gives me the error: -o /home/smrtanalysis/dario_test/canu_test/250k_assembly/canu-scripts/canu.02.out /home/smrtanalysis/dario_test/canu_test/250k_assembly/canu-scripts/canu.02.sh Your job 65607 ("canu_N22_test250k") has been submitted

-- Finished on Wed Feb 3 17:59:31 2016 (lickety-split) with 339.3 GB free disk space

Dario

On 02/03/2016 11:54 AM, Sergey Koren wrote:

The java version is different on the head node environment versus the qsubed job? You can add -V to your gridOptions line which should force the environment to be preserved in your submitted command.

— Reply to this email directly or view it on GitHub https://github.com/marbl/canu/issues/40#issuecomment-179402185.

Dario Copetti, PhD Research Associate | Arizona Genomics Institute University of Arizona | BIO5

1657 E. Helen St. Tucson, AZ 85721, USA www.genome.arizona.edu

skoren commented 8 years ago

I didn't see an error in your last comment. It submitted the job to the grid which is OK, the output will now be in canu.02.out and the job will progress in the background on the grid.

dcopetti commented 8 years ago

You are right: something must be moving, because two nodes have some activity now: [smrtanalysis@pac canu_test]$ qhost HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO

SWAPUS

global - - - - - - - n001 lx26-amd64 12 1.00 31.3G 749.0M 7.5G 148.5M n002 lx26-amd64 12 2.85 31.3G 1.2G 7.5G 1.1G n003 lx26-amd64 12 1.00 31.3G 579.8M 7.5G 196.1M pac lx26-amd64 12 3.16 31.3G 5.0G 2.0G 2.0G [smrtanalysis@pac canu_test]$ qstat job-ID prior name user state submit/start at

queue slots ja-task-ID

65596 0.55500 QLOGIN smrtanalysis r 02/03/2016 10:58:38 all.q@n002.genome.arizona.edu 1 65605 0.55500 canu_N22_t smrtanalysis r 02/03/2016 15:50:08 all.q@n002.genome.arizona.edu 1 65606 0.55500 canu_N22_t smrtanalysis r 02/03/2016 15:50:08 all.q@n002.genome.arizona.edu 1 65602 0.60500 cormhap_N2 smrtanalysis Eqw 02/03/2016 13:57:40 12 1-9:1 65600 0.50500 canu_N22_t smrtanalysis Eqw 02/03/2016 13:57:11 1 65604 0.50500 canu_N22_t smrtanalysis Eqw 02/03/2016 13:57:55 1 65603 0.00000 canu_N22_t smrtanalysis hqw 02/03/2016 13:57:40 1

For the processes in Eqw status, I see:

[smrtanalysis@pac canu_test]$ qstat -j 65604

job_number: 65604 exec_file: job_scripts/65604 submission_time: Wed Feb 3 13:57:55 2016 owner: smrtanalysis uid: 601 group: smrtanalysis gid: 601 sge_o_home: /home/smrtanalysis sge_o_log_name: smrtanalysis sge_o_path: /usr/share/gridengine/bin/lx26-amd64:/usr/lib64/qt-3.3/bin:/usr/NX/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/pssc/bin:/opt/openmpi/bin:/opt/torque/bin:/opt/torque/sbin:/home/smrtanalysis/bin:/opt/smrtanalysis/install/smrtanalysis-2.3.0.140936/analysis/bin/:/opt/tools/:/opt/tools/amos-3.1.0 sge_o_shell: /bin/bash sge_o_workdir: /home/smrtanalysis/dario_test/canu_test/250k_assembly sge_o_host: pac account: sge cwd: /home/smrtanalysis/dario_test/canu_test/250k_assembly merge: y hard resource_list: h_vmem=12g mail_list: smrtanalysis@pac.genome.arizona.edu notify: FALSE job_name: canu_N22_test250k stdout_path_list: NONE:NONE:/home/smrtanalysis/dario_test/canu_test/250k_assembly/canu-scripts/canu.01.out jobshare: 0 shell_list: NONE:/bin/bash env_list: HOSTNAME=pac.genome.arizona.edu,SHELL=/bin/bash,TERM=xterm,HISTSIZE=1000,SSH_CLIENT=128.196.149.30 55610 22,SGE_CELL=default,OLDPWD=/home/smrtanalysis/dario_test/canu_test,QTDIR=/usr/lib64/qt-3.3,QTINC=/usr/lib64/qt-3.3/include,SSH_TTY=/dev/pts/3,USER=smrtanalysis,LS_COLORS= [...] ,CANU_DIRECTORY=/home/smrtanalysis/dario_test/canu_test/250k_assembly,PATH=/usr/share/gridengine/bin/lx26-amd64:/usr/lib64/qt-3.3/bin:/usr/NX/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/opt/pssc/bin:/opt/openmpi/bin:/opt/torque/bin:/opt/torque/sbin:/home/smrtanalysis/bin:/opt/smrtanalysis/install/smrtanalysis-2.3.0.140936/analysis/bin/:/opt/tools/:/opt/tools/amos-3.1.0,MAIL=/var/spool/mail/smrtanalysis,NXDIR=/usr/NX,PWD=/home/smrtanalysis/dario_test/canu_test/250k_assembly,SGE_EXECD_PORT=6445,LANG=en_US.UTF-8,MODULEPATH=/usr/share/Modules/modulefiles:/etc/modulefiles,SGE_QMASTER_PORT=6444,LOADEDMODULES=NONE,SGE_ROOT=/usr/share/gridengine,HISTCONTROL=ignoredups,SSH_ASKPASS=/usr/libexec/openssh/gnome-ssh-askpass,HOME=/home/smrtanalysis,SHLVL=2,LOGNAME=smrtanalysis,CVS_RSH=ssh,QTLIB=/usr/lib64/qt-3.3/lib,SSH_CONNECTION=128.196.149.30 55610 150.135.237.5 22,MODULESHOME=/usr/share/Modules,LESSOPEN=|/usr/bin/lesspipe.sh %s,SGE_CLUSTER_NAME=p6444,G_BROKEN_FILENAMES=1,BASH_FUNCmodule()=() {
eval /usr/bin/modulecmd bash $*,=/usr/share/gridengine/bin/lx26-amd64/qsub script_file: /home/smrtanalysis/dario_test/canu_test/250k_assembly/canu-scripts/canu.01.sh parallel environment: smp range: 1 error reason 1: 02/03/2016 15:59:22 [601:6438]: error: can't open output file "/home/smrtanalysis/dario_test/canu_te scheduling info: (Collecting of scheduler job information is turned off) Is the last line telling something? Or are those 3 maybe just waiting for the three in r mode above?

Now, how does Canu know how many nodes/cores to use? does it use all the cluster's resources, or can I tell him to use some and leave some cores for other computation? Thanks,

Dario

On 02/03/2016 04:00 PM, Sergey Koren wrote:

I didn't see an error in your last comment. It submitted the job to the grid which is OK, the output will now be in canu.02.out and the job will progress in the background on the grid.

— Reply to this email directly or view it on GitHub https://github.com/marbl/canu/issues/40#issuecomment-179519476.

Dario Copetti, PhD Research Associate | Arizona Genomics Institute University of Arizona | BIO5

1657 E. Helen St. Tucson, AZ 85721, USA www.genome.arizona.edu

skoren commented 8 years ago

The E state jobs might just be your previous failed runs.I would erase them from the queue.

Canu parses the qhosts output to detects the machines in your cluster and picks job sizes that would enable it to run across the most machines given the resources available and the genome size you're assembling. Several steps are submitted as large array jobs in which case it could potentially consume a large part of your cluster, depending on how your scheduler works. The SGE scheduler might restrict the number of cores that a user/parallel environment can request. Generally, I prefer letting the cluster scheduler manage your jobs rather than trying to manage the job scheduling yourself. However, you can customize each step if you want, for example by using the -tc parameter to restricts the number of array jobs that can run in parallel at a time. For example, -tc 10 on a 100 job array would ensure only 10 jobs can be scheduled at a time, limiting the cores used by your job. You can see a list of grid options using:

canu -options |grep gridOptions
gridOptions                             Grid engine options applied to all jobs
gridOptionsExecutive                    Grid engine options applied to the canu executive script
gridOptionsJobName                      Grid jobs job-name suffix
gridOptionsbat                          Grid engine options applied to unitig construction jobs
gridOptionscns                          Grid engine options applied to unitig consensus jobs
gridOptionscor                          Grid engine options applied to read correction jobs
gridOptionscormhap                      Grid engine options applied to mhap overlaps for correction jobs
gridOptionscorovl                       Grid engine options applied to overlaps for correction jobs
gridOptionsmeryl                        Grid engine options applied to mer counting jobs
gridOptionsobtmhap                      Grid engine options applied to mhap overlaps for trimming jobs
gridOptionsobtovl                       Grid engine options applied to overlaps for trimming jobs
gridOptionsoea                          Grid engine options applied to overlap error adjustment jobs
gridOptionsovb                          Grid engine options applied to overlap store bucketizing jobs
gridOptionsovs                          Grid engine options applied to overlap store sorting jobs
gridOptionsred                          Grid engine options applied to read error detection jobs
gridOptionsutgmhap                      Grid engine options applied to mhap overlaps for unitig construction jobs
gridOptionsutgovl                       Grid engine options applied to overlaps for unitig construction jobs

You can add the tc parameter to all the options except gridOptionsExecutive and gridOptions as those should all be array jobs. Those options should also all be additive in that your tc option will be included along with whatever canu picks for the parameters.

dcopetti commented 8 years ago

Sergey,

The process seem to have run and it stopped for a java problem. The canu.02.out file says: ERROR: mhap overlapper requires java version at least 1.8.0; you have 1.7.0_51, but if I check the version on the node where I run it from I see: [smrtanalysis@n002 canu_test]$ java -version java version "1.8.0_66" Java(TM) SE Runtime Environment (build 1.8.0_66-b17) Java HotSpot(TM) 64-Bit Server VM (build 25.66-b17, mixed mode) Maybe we did not propagate the new java well enough yet, to all the nodes maybe? Ann, I doubt it.

If it can help, the canu.01.out says: bash: BASH_FUNC_module(): line 0: syntax error near unexpected token )' bash: BASH_FUNC_module(): line 0:BASH_FUNC_module() () { eval /usr/bin/modulecmd bash $*' bash: error importing function definition for BASH_FUNC_module' bash: module: line 1: syntax error: unexpected end of file bash: error importing function definition formodule' -- Detected Java(TM) Runtime Environment '1.8.0_66' (from 'java'). and at the bottom: -- Finished on Wed Feb 3 22:17:38 2016 (15974 seconds) with 338.4 GB

free disk space

runCommandSilently() gnuplot < /home/smrtanalysis/dario_test/canu_test/250k_assembly/correction/N22_test250k.gkpStore/readlengths.gp \

/dev/null 2>&1 ERROR: Failed with signal HUP (1) runCommandSilently() gnuplot < /home/smrtanalysis/dario_test/canu_test/250k_assembly/correction/N22_test250k.gkpStore/readlengths.gp \ /dev/null 2>&1 ERROR: Failed with signal HUP (1)

Thanks,

Dario

On 02/03/2016 04:00 PM, Sergey Koren wrote:

I didn't see an error in your last comment. It submitted the job to the grid which is OK, the output will now be in canu.02.out and the job will progress in the background on the grid.

— Reply to this email directly or view it on GitHub https://github.com/marbl/canu/issues/40#issuecomment-179519476.

Dario Copetti, PhD Research Associate | Arizona Genomics Institute University of Arizona | BIO5

1657 E. Helen St. Tucson, AZ 85721, USA www.genome.arizona.edu

dcopetti commented 8 years ago

I just checked the nodes, 2 of them still have the older version, sorry for that.

In a file I saw that the scripts inside use a /fastq as an input: I am using a fasta, do you think this will cause a problem? This would be such a stupid error from me, I am confusing it with Falcon probably.

Dario

On 02/03/2016 04:00 PM, Sergey Koren wrote:

I didn't see an error in your last comment. It submitted the job to the grid which is OK, the output will now be in canu.02.out and the job will progress in the background on the grid.

— Reply to this email directly or view it on GitHub https://github.com/marbl/canu/issues/40#issuecomment-179519476.

Dario Copetti, PhD Research Associate | Arizona Genomics Institute University of Arizona | BIO5

1657 E. Helen St. Tucson, AZ 85721, USA www.genome.arizona.edu

brianwalenz commented 8 years ago

fasta is no problem, we use it almost exclusively.

dcopetti commented 8 years ago

After setting the latest java on all nodes, I run it again and it went on for a while. After it ended, I found these lines in some output files that could be diagnostic:

canu.03.out: bash: BASH_FUNC_module(): line 0: syntax error near unexpected token )' bash: BASH_FUNC_module(): line 0:BASH_FUNC_module() () { eval /usr/bin/modulecmd bash $*' bash: error importing function definition for `BASH_FUNC_module'

and canu.05out canu failed with 'failed to precompute mhap indices. Made 2 attempts, jobs still failed'.

The command line was this: /opt/canu/Linux-amd64/bin/canu -d /../250k_assembly -p N22_test250k genomeSize=380m corMinCoverage=2 gridEngineMemoryOption="-l h_vmem=MEMORY" gridEngineThreadsOption="-pe smp THREADS" gridOptions="-V -S /bin/bash" errorRate=0.18 -pacbio-raw input_subreads.fq

Thanks,

Dario

On 02/04/2016 07:31 AM, Sergey Koren wrote:

The E state jobs might just be your previous failed runs.I would erase them from the queue.

Canu parses the qhosts output to detects the machines in your cluster and picks job sizes that would enable it to run across the most machines given the resources available and the genome size you're assembling. Several steps are submitted as large array jobs in which case it could potentially consume a large part of your cluster, depending on how your scheduler works. The SGE scheduler might restrict the number of cores that a user/parallel environment can request. Generally, I prefer letting the cluster scheduler manage your jobs rather than trying to manage the job scheduling yourself. However, you can customize each step if you want, for example by using the -tc parameter to restricts the number of array jobs that can run in parallel at a time. For example, -tc 10 on a 100 job array would ensure only 10 jobs can be scheduled at a time, limiting the cores used by your job. You can see a list of grid options using:

|canu -options |grep gridOptions gridOptions Grid engine options applied to all jobs gridOptionsExecutive Grid engine options applied to the canu executive script gridOptionsJobName Grid jobs job-name suffix gridOptionsbat Grid engine options applied to unitig construction jobs gridOptionscns Grid engine options applied to unitig consensus jobs gridOptionscor Grid engine options applied to read correction jobs gridOptionscormhap Grid engine options applied to mhap overlaps for correction jobs gridOptionscorovl Grid engine options applied to overlaps for correction jobs gridOptionsmeryl Grid engine options applied to mer counting jobs gridOptionsobtmhap Grid engine options applied to mhap overlaps for trimming jobs gridOptionsobtovl Grid engine options applied to overlaps for trimming jobs gridOptionsoea Grid engine options applied to overlap error adjustment jobs gridOptionsovb Grid engine options applied to overlap store bucketizing jobs gridOptionsovs Grid engine options applied to overlap store sorting jobs gridOptionsred Grid engine options applied to read error detection jobs gridOptionsutgmhap Grid engine options applied to mhap overlaps for unitig construction jobs gridOptionsutgovl Grid engine options applied to overlaps for unitig construction jobs |

You can add the tc parameter to all the options except gridOptionsExecutive and gridOptions as those should all be array jobs. Those options should also all be additive in that your tc option will be included along with whatever canu picks for the parameters.

— Reply to this email directly or view it on GitHub https://github.com/marbl/canu/issues/40#issuecomment-179872010.

Dario Copetti, PhD Research Associate | Arizona Genomics Institute University of Arizona | BIO5

1657 E. Helen St. Tucson, AZ 85721, USA www.genome.arizona.edu

brianwalenz commented 8 years ago

The little bit of googling on 'BASH_FUNC_module' hints this is outside canu. See for example: https://groups.google.com/forum/#!topic/genome-au-cluster-help/J1fKmk8XB1Q

Are there interesting messages in the precompute logs ($asm/correction/1-overlapper/).

At some point, remove the whole assembly directory and start over. There is probably lots of crud in there from all the restarts, and it'll be easier to figure out what's breaking without the junk.

dcopetti commented 8 years ago

I always remove the old folder when starting a new job. We are under bash: $ which bash /bin/bash $ bash -version GNU bash, version 4.1.2(1)-release (x86_64-redhat-linux-gnu)

In the 1-overlapper I found these: $ less 3.out bash: module: line 1: syntax error: unexpected end of file bash: error importing function definition for BASH_FUNC_module' Dumping reads from 39001 to 58500 (inclusive). Starting mhap precompute. Error occurred during initialization of VM Could not reserve enough space for 13631488KB object heap mv: cannot stat /home/smrtanalysis/dario_test/canu_test/250k_assembly/correction/1-overlapper/blocks/000003.dat': No such file or directory Mhap failed. Dumping reads from 39001 to 58500 (inclusive).

We are working on the bash issue. Thanks,

Dario

On 02/04/2016 04:03 PM, brianwalenz wrote:

The little bit of googling on 'BASH_FUNC_module' hints this is outside canu. See for example: https://groups.google.com/forum/#!topic/genome-au-cluster-help/J1fKmk8XB1Q https://groups.google.com/forum/#%21topic/genome-au-cluster-help/J1fKmk8XB1Q

Are there interesting messages in the precompute logs ($asm/correction/1-overlapper/).

At some point, remove the whole assembly directory and start over. There is probably lots of crud in there from all the restarts, and it'll be easier to figure out what's breaking without the junk.

— Reply to this email directly or view it on GitHub https://github.com/marbl/canu/issues/40#issuecomment-180094609.

Dario Copetti, PhD Research Associate | Arizona Genomics Institute University of Arizona | BIO5

1657 E. Helen St. Tucson, AZ 85721, USA www.genome.arizona.edu

skoren commented 8 years ago

The error indicates your JVM failed to initialize while allocating 13GB of ram (13631488KB). This would indicate most likely more than one job is being scheduled on each of your machines (which it shouldn't be since the qsub command is asking for 12 cores and 13GB of ram so that would mean the smp parameter is being ignored) or there are other processes on your machine that are taking up the available memory. This could be an issue with h_vmem which is usually not consumable meaning it schedules based on current memory usage not peak usage. I mentioned this issue in a comment above, the problem is if a job runs and requests 30GB with h_vmem and just after it starts canu submits a job requesting 13GB it would get scheduled on the same machine since the 30GB is not reserved and the process hasn't had time to reach its full allocation. Then as both run, the JVM tries to lock 13GB which is no longer free because the other process is up to 30GB.

You'll have to diagnose the state of the machine just before the JVM error to see what is taking memory and why the JVM can't run. This is an issue outside of Canu's control.

dcopetti commented 8 years ago

Thanks for the explanation, we will work on that now.

Dario

On 02/04/2016 04:53 PM, Sergey Koren wrote:

The error indicates your JVM failed to initialize while allocating 13GB of ram (13631488KB). This would indicate most likely more than one job is being scheduled on each of your machines (which it shouldn't be since the qsub command i asking for 12 cores and 13GB of ram) or there are other processes on your machine that are taking up the available memory. This could be an issue with h_vmem which is usually not consumable meaning it schedules based on current memory usage not peak usage. I mentioned this issue in a comment above, the problem is if a job runs and requests 30GB with h_vmem and just after it starts canu submits a job requesting 13GB it would get scheduled on the same machine since the 30GB is not reserved and the process hasn't had time to reach its full allocation. Then as both run, the JVM tries to lock 13GB which is no longer free because the other process is up to 30GB.

You'll have to diagnose the state of the machine just before the JVM error to see what is taking memory and why the JVM can't run. This is an issue outside of Canu's control.

— Reply to this email directly or view it on GitHub https://github.com/marbl/canu/issues/40#issuecomment-180109157.

Dario Copetti, PhD Research Associate | Arizona Genomics Institute University of Arizona | BIO5

1657 E. Helen St. Tucson, AZ 85721, USA www.genome.arizona.edu

skoren commented 8 years ago

Have you been able to resolve this issue? I'm closing for inactivity but if you need to, feel free to re-open.

e-sevin commented 8 years ago

Hello, not sure this is the right place, but it definitely fits this issue's topic. I'm trying to run Canu (v1.1) on SGE, which I do not know well, but it fails ('can't configure for SGE') with:

-- WARNING:  Couldn't determine the SGE resource to request memory.
-- WARNING:  No valid choices found!  Find an appropriate complex name (qconf -sc) and set:
-- WARNING:    gridEngineMemoryOption="-l <name>=MEMORY"

However none of the options returned by qconf -sc | memory are consumables... Does this mean I need to contact the admin and ask him to change the cluster config? Or is there a workaround within Canu?

Thanks

jsmedmar commented 4 years ago

See https://stackoverflow.com/questions/18708085/sge-h-vmem-vs-java-xmx-xms

marbl / canu

SGE general settings #40

-- User supplied Grid Engine consumable '-l h_vmem=MEMORY'.

Engine control.

and 15 GB memory for stage 'falcon_sense (read correction)'.

free disk space

-- User supplied Grid Engine consumable '-l h_vmem=MEMORY'.

Engine control.

and 15 GB memory for stage 'falcon_sense (read correction)'.

free disk space

-----------------------------------------------------------------------------------------------------

-- User supplied Grid Engine consumable '-l h_vmem=MEMORY'.

Engine control.

and 15 GB memory for stage 'falcon_sense (read correction)'.

-- Detected Sun Grid Engine in '/usr/share/gridengine/default'.

SWAPUS

queue slots ja-task-ID

[smrtanalysis@pac canu_test]$ qstat -j 65604

free disk space