Closed gmoneyomics closed 4 years ago
I think this is an issue with your grid configuration. The resource option you supplied only includes memory but not the parallel thread environment. Normally mem_free reserves memory per core not per job. Thus, canu will scale the total request (22gb in this case) to divide it by the cores (12). So I expect each of your jobs is only requesting 22/12 = 1gb of memory, exceeding that, and getting killed. qacct -j 72
should show more info on the resources the job requested/used.
Modify your gridEngineResourceOption
to include "-pe <parallel env> THREADS"
, you can use qconf -spl
to find one (it has to be set to allocate by pe_slots, not round-robin). Depending on how your JVM is set up, it may also be over-reserving memory for system overheads so you may also need to add gridOptionscormhap="-l mem_free=30g
.
#!/bin/bash
~/canu-1.9/Linux-amd64/bin/canu \
-p a_colubris_canu -d /shared/hummingbird \
java=/shared/jdk1.8.0_241/bin/java \
genomeSize=1g \
-nanopore-raw /shared/hbird_all.fastq \
gridEngineResourceOption="-l mem_free=MEMORY -pe THREADS" \
gridOptionscormhap="-l mem_free=30g" \
minReadLength=5000
Do I specify number of threads?
cd /shared/hummingbird
qsub \
-l mem_free=4g \
-pe 1 \
-cwd \
-N 'canu_a_colubris_canu' \
-j y \
-o canu-scripts/canu.03.out canu-scripts/canu.03.sh
qsub: Numerical value invalid!
The initial portion of string "cwd" contains no decimal number
-- Finished on Thu Feb 6 15:00:59 2020 (in the blink of an eye) with 7103.832 GB free disk space
----------------------------------------
ERROR:
ERROR: Failed with exit code 7. (rc=1792)
ERROR:
-- Failed to submit Canu executive. Delay 10 seconds and try again.
----------------------------------------
-- Starting command on Thu Feb 6 15:01:09 2020 with 7103.832 GB free disk space
cd /shared/hummingbird
qsub \
-l mem_free=4g \
-pe 1 \
-cwd \
-N 'canu_a_colubris_canu' \
-j y \
-o canu-scripts/canu.03.out canu-scripts/canu.03.sh
qsub: Numerical value invalid!
The initial portion of string "cwd" contains no decimal number
-- Finished on Thu Feb 6 15:01:09 2020 (in the blink of an eye) with 7103.832 GB free disk space
----------------------------------------
ERROR:
ERROR: Failed with exit code 7. (rc=1792)
ERROR:
-- Failed to submit Canu executive. Giving up after two tries.```
Github messed up my original response, the pe option requires a parallel environment name which varies from system to system, you can check with the qconf command I posted originally. Then you need to use "-pe <whatever name you found maybe smp> THREADS"
~/canu-1.9/Linux-amd64/bin/canu \
-p a_colubris_canu -d /shared/hummingbird \
java=/shared/jdk1.8.0_241/bin/java \
genomeSize=1g \
-nanopore-raw /shared/hbird_all.fastq \
gridEngineResourceOption="-pe smp THREADS -l mem_free=MEMORY" \
gridOptionscormhap="-l mem_free=30g" \
minReadLength=5000
The pe names were smp, mpi and make mpi and make gave qsub errors but smp was submitted but not canu cannot find the data
None of the added options would change what reads Canu can find. Did you remove any output files from the previous runs? I'd suggest just removing the current run you have and starting with the updated command from scratch, you weren't very far in the assembly anyway.
Thank you! It seems to be working now, and thanks for getting back to me so quickly! This isn't canu related but the nodes seem to be going unresponsive
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
all.q@ip-10-0-16-109.ec2.inter BIP 0/1/36 0.21 lx-amd64
94 0.55500 canu_a_col ubuntu r 02/06/2020 15:35:51 1
---------------------------------------------------------------------------------
all.q@ip-10-0-16-156.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-16-4.ec2.interna BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-17-205.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-18-109.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-18-149.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-18-60.ec2.intern BIP 0/0/36 0.00 lx-amd64
---------------------------------------------------------------------------------
all.q@ip-10-0-20-144.ec2.inter BIP 0/0/36 -NA- lx-amd64 au
---------------------------------------------------------------------------------
all.q@ip-10-0-20-207.ec2.inter BIP 0/0/36 -NA- lx-amd64 au
---------------------------------------------------------------------------------
all.q@ip-10-0-21-98.ec2.intern BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-22-242.ec2.inter BIP 0/0/36 -NA- lx-amd64 au
---------------------------------------------------------------------------------
all.q@ip-10-0-22-43.ec2.intern BIP 0/0/36 -NA- lx-amd64 au
---------------------------------------------------------------------------------
all.q@ip-10-0-25-221.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-25-82.ec2.intern BIP 0/0/36 -NA- lx-amd64 au
---------------------------------------------------------------------------------
all.q@ip-10-0-26-174.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-28-212.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-29-103.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-29-182.ec2.inter BIP 0/0/36 -NA- lx-amd64 au
---------------------------------------------------------------------------------
all.q@ip-10-0-29-189.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-30-155.ec2.inter BIP 0/0/36 -NA- lx-amd64 au
---------------------------------------------------------------------------------
all.q@ip-10-0-31-142.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-31-213.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-31-222.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-31-34.ec2.intern BIP 0/0/36 0.00 lx-amd64
---------------------------------------------------------------------------------
all.q@ip-10-0-31-80.ec2.intern BIP 0/0/1 -NA- lx-amd64 auo
there is supposed to be 10, is there a way to get them back?
Not sure, if they're unresponsive they might be overloaded by jobs or the node that is running the main SGE manager is overloaded. You can try to use the ganglia report to see what memory/etc usage the nodes have.
I didn't configure the instance with ganglia so I'm not sure I can add it now. I did qdel -u
to clear all the jobs it had running but I have attempted to run canu many times on this instance now so it is possible that its overloaded.
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
all.q@ip-10-0-16-109.ec2.inter BIP 0/36/36 12.29 lx-amd64
99 0.55500 cormhap_a_ ubuntu r 02/06/2020 16:30:51 12 3
99 0.55500 cormhap_a_ ubuntu r 02/06/2020 16:30:51 12 6
99 0.55500 cormhap_a_ ubuntu r 02/06/2020 16:30:51 12 9
---------------------------------------------------------------------------------
all.q@ip-10-0-16-156.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-16-4.ec2.interna BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-17-205.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-18-109.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-18-149.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-18-60.ec2.intern BIP 0/36/36 11.70 lx-amd64
99 0.55500 cormhap_a_ ubuntu r 02/06/2020 16:30:51 12 1
99 0.55500 cormhap_a_ ubuntu r 02/06/2020 16:30:51 12 4
99 0.55500 cormhap_a_ ubuntu r 02/06/2020 16:30:51 12 7
---------------------------------------------------------------------------------
all.q@ip-10-0-20-144.ec2.inter BIP 0/0/36 -NA- lx-amd64 au
---------------------------------------------------------------------------------
all.q@ip-10-0-20-207.ec2.inter BIP 0/0/36 -NA- lx-amd64 au
---------------------------------------------------------------------------------
all.q@ip-10-0-21-98.ec2.intern BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-22-242.ec2.inter BIP 0/0/36 -NA- lx-amd64 au
---------------------------------------------------------------------------------
all.q@ip-10-0-22-43.ec2.intern BIP 0/0/36 -NA- lx-amd64 au
---------------------------------------------------------------------------------
all.q@ip-10-0-25-221.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-25-82.ec2.intern BIP 0/0/36 -NA- lx-amd64 au
---------------------------------------------------------------------------------
all.q@ip-10-0-26-174.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-28-212.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-29-103.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-29-182.ec2.inter BIP 0/0/36 -NA- lx-amd64 au
---------------------------------------------------------------------------------
all.q@ip-10-0-29-189.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-30-155.ec2.inter BIP 0/0/36 -NA- lx-amd64 au
---------------------------------------------------------------------------------
all.q@ip-10-0-31-142.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-31-213.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-31-222.ec2.inter BIP 0/0/1 -NA- lx-amd64 auo
---------------------------------------------------------------------------------
all.q@ip-10-0-31-34.ec2.intern BIP 0/36/36 11.07 lx-amd64
99 0.55500 cormhap_a_ ubuntu r 02/06/2020 16:30:51 12 2
99 0.55500 cormhap_a_ ubuntu r 02/06/2020 16:30:51 12 5
99 0.55500 cormhap_a_ ubuntu r 02/06/2020 16:30:51 12 8
---------------------------------------------------------------------------------
all.q@ip-10-0-31-80.ec2.intern BIP 0/0/1 -NA- lx-amd64 auo
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
99 0.55500 cormhap_a_ ubuntu qw 02/06/2020 16:30:40 12 10-120:1
100 0.00000 canu_a_col ubuntu hqw 02/06/2020 16:30:40 1
It does look like canu is working now, its just all the nodes aren't reachable. But, thank you for all of your help!
Hi. I started a new cluster and started canu over again using the script we talked about before:
~/canu-1.9/Linux-amd64/bin/canu \
-p a_colubris_canu -d /shared/hummingbird \
java=/home/ubuntu/jdk-11.0.5/bin/java \
genomeSize=1g \
-nanopore-raw /shared/hbird_gDNA_all.fastq \
gridEngineResourceOption="-pe smp THREADS -l mem_free=MEMORY" \
gridOptionscormhap="-l mem_free=30g" \
minReadLength=2000
Everything looked great at first, it was allocating enough memory to all the jobs and successfully did cormhap precompute, but when it got to cormhap some of the nodes are dying again but this time with unfinished jobs in them. These still have .WORKING files and the jobs seem to be just stalled in there. Will canu automatically detect that these are unfinished and resubmit them or should I do it manually?
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
all.q@ip-10-0-17-232.ec2.inter BIP 0/36/1 -NA- lx-amd64 auo
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 00:52:15 12 37
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 00:52:15 12 38
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 02:58:30 12 65
---------------------------------------------------------------------------------
all.q@ip-10-0-18-216.ec2.inter BIP 0/36/36 5.53 lx-amd64
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 12:31:00 12 153
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 12:37:00 12 155
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 14:00:30 12 166
---------------------------------------------------------------------------------
all.q@ip-10-0-18-4.ec2.interna BIP 0/36/36 5.47 lx-amd64
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 12:49:15 12 160
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 12:49:15 12 161
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 12:49:15 12 162
---------------------------------------------------------------------------------
all.q@ip-10-0-19-120.ec2.inter BIP 0/36/1 -NA- lx-amd64 auo
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 02:01:30 12 58
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 02:01:45 12 59
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 03:58:15 12 71
---------------------------------------------------------------------------------
all.q@ip-10-0-19-14.ec2.intern BIP 0/36/36 6.63 lx-amd64
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 12:20:45 12 148
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 12:36:45 12 154
9 0.55500 cormhap_a_ ubuntu t 02/07/2020 14:53:45 12 173
---------------------------------------------------------------------------------
all.q@ip-10-0-19-81.ec2.intern BIP 0/36/36 5.19 lx-amd64
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 11:07:15 12 139
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 13:03:45 12 163
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 13:40:00 12 165
---------------------------------------------------------------------------------
all.q@ip-10-0-20-148.ec2.inter BIP 0/36/1 -NA- lx-amd64 auo
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 02:18:45 12 61
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 02:19:00 12 62
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 03:36:30 12 69
---------------------------------------------------------------------------------
all.q@ip-10-0-21-141.ec2.inter BIP 0/36/36 5.96 lx-amd64
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 11:16:00 12 141
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 12:48:00 12 159
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 14:01:15 12 167
---------------------------------------------------------------------------------
all.q@ip-10-0-21-9.ec2.interna BIP 0/36/36 5.11 lx-amd64
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 11:21:00 12 142
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 12:29:15 12 150
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 12:39:45 12 156
---------------------------------------------------------------------------------
all.q@ip-10-0-22-136.ec2.inter BIP 0/36/1 -NA- lx-amd64 auo
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 05:17:00 12 89
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 05:33:30 12 93
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 07:35:00 12 110
---------------------------------------------------------------------------------
all.q@ip-10-0-22-196.ec2.inter BIP 0/36/1 -NA- lx-amd64 auo
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 00:56:00 12 41
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 01:05:00 12 45
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 01:12:45 12 51
---------------------------------------------------------------------------------
all.q@ip-10-0-23-219.ec2.inter BIP 0/36/1 -NA- lx-amd64 auo
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 08:20:30 12 115
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 08:40:00 12 122
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 09:23:00 12 132
---------------------------------------------------------------------------------
all.q@ip-10-0-26-117.ec2.inter BIP 0/36/1 -NA- lx-amd64 auo
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 04:14:30 12 73
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 04:14:45 12 74
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 05:56:30 12 97
---------------------------------------------------------------------------------
all.q@ip-10-0-26-24.ec2.intern BIP 0/36/36 6.89 lx-amd64
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 13:29:30 12 164
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 14:19:00 12 168
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 14:28:45 12 170
---------------------------------------------------------------------------------
all.q@ip-10-0-27-206.ec2.inter BIP 0/36/1 -NA- lx-amd64 auo
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 01:16:30 12 53
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 01:21:30 12 54
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 04:30:30 12 84
---------------------------------------------------------------------------------
all.q@ip-10-0-27-97.ec2.intern BIP 0/36/36 6.33 lx-amd64
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 12:19:15 12 147
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 12:44:45 12 157
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 12:46:30 12 158
---------------------------------------------------------------------------------
all.q@ip-10-0-29-12.ec2.intern BIP 0/36/36 4.54 lx-amd64
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 12:30:15 12 151
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 14:41:30 12 171
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 14:41:45 12 172
---------------------------------------------------------------------------------
all.q@ip-10-0-30-52.ec2.intern BIP 0/36/1 -NA- lx-amd64 auo
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 00:37:15 12 33
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 00:37:30 12 34
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 01:24:45 12 57
---------------------------------------------------------------------------------
all.q@ip-10-0-31-140.ec2.inter BIP 0/36/36 6.58 lx-amd64
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 11:44:00 12 144
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 12:15:15 12 145
9 0.55500 cormhap_a_ ubuntu r 02/07/2020 14:22:30 12 169
############################################################################
- PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
9 0.55500 cormhap_a_ ubuntu qw 02/06/2020 23:50:16 12 174-297:1
10 0.00000 canu_a_col ubuntu hqw 02/06/2020 23:50:16 1
Canu will detect and resume unfinished jobs, you should try to track down why the nodes are dying.
I don't think mem_free is working correctly on your system. I see 3 jobs per node and your nodes have 36 cores and 68gb ram. Each cormhap job is requesting 12 cores and 30gb of ram/core (due to the gridOptionscormhap="-l mem_free=30g" option). This means technically none of them should fit on a node. Even if the memory is not requested per core on your system, only two should fit per node (68/30 = 2). I would guess your mem_free option isn't a consumable resource so the memory is checked at the start of the run but not actually reserved for a process so the nodes are effectively over-subscribed. You should check the memory settings with qconf -sc
and look for a memory option that is consumable:
#name shortcut type relop requestable consumable default urgency
#----------------------------------------------------------------------------------------
h_vmem h_vmem MEMORY <= YES NO 0 0
mem_free mf MEMORY <= YES YES 0 0
For example given the above you want to use mem_free not h_vmem.
so in this case I would change the command to -l slots=30g
instead of mem_free
#name shortcut type relop requestable consumable default urgency
#--------------------------------------------------------------------------------------
arch a STRING == YES NO NONE 0
calendar c STRING == YES NO NONE 0
cpu cpu DOUBLE >= YES NO 0 0
display_win_gui dwg BOOL == YES NO 0 0
h_core h_core MEMORY <= YES NO 0 0
h_cpu h_cpu TIME <= YES NO 0:0:0 0
h_data h_data MEMORY <= YES NO 0 0
h_fsize h_fsize MEMORY <= YES NO 0 0
h_rss h_rss MEMORY <= YES NO 0 0
h_rt h_rt TIME <= YES NO 0:0:0 0
h_stack h_stack MEMORY <= YES NO 0 0
h_vmem h_vmem MEMORY <= YES NO 0 0
hostname h HOST == YES NO NONE 0
load_avg la DOUBLE >= NO NO 0 0
load_long ll DOUBLE >= NO NO 0 0
load_medium lm DOUBLE >= NO NO 0 0
load_short ls DOUBLE >= NO NO 0 0
m_core core INT <= YES NO 0 0
m_socket socket INT <= YES NO 0 0
m_thread thread INT <= YES NO 0 0
m_topology topo STRING == YES NO NONE 0
m_topology_inuse utopo STRING == YES NO NONE 0
mem_free mf MEMORY <= YES NO 0 0
mem_total mt MEMORY <= YES NO 0 0
mem_used mu MEMORY >= YES NO 0 0
min_cpu_interval mci TIME <= NO NO 0:0:0 0
np_load_avg nla DOUBLE >= NO NO 0 0
np_load_long nll DOUBLE >= NO NO 0 0
np_load_medium nlm DOUBLE >= NO NO 0 0
np_load_short nls DOUBLE >= NO NO 0 0
num_proc p INT == YES NO 0 0
qname q STRING == YES NO NONE 0
rerun re BOOL == NO NO 0 0
s_core s_core MEMORY <= YES NO 0 0
s_cpu s_cpu TIME <= YES NO 0:0:0 0
s_data s_data MEMORY <= YES NO 0 0
s_fsize s_fsize MEMORY <= YES NO 0 0
s_rss s_rss MEMORY <= YES NO 0 0
s_rt s_rt TIME <= YES NO 0:0:0 0
s_stack s_stack MEMORY <= YES NO 0 0
s_vmem s_vmem MEMORY <= YES NO 0 0
seq_no seq INT == NO NO 0 0
slots s INT <= YES YES 1 1000
swap_free sf MEMORY <= YES NO 0 0
swap_rate sr MEMORY >= YES NO 0 0
swap_rsvd srsv MEMORY >= YES NO 0 0
swap_total st MEMORY <= YES NO 0 0
swap_used su MEMORY >= YES NO 0 0
tmpdir tmp STRING == NO NO NONE 0
virtual_free vf MEMORY <= YES NO 0 0
virtual_total vt MEMORY <= YES NO 0 0
virtual_used vu MEMORY >= YES NO 0 0
# >#< starts a comment but comments are not saved across edits --------
No slots is not a memory option. I would suggest editing the settings to make one of the memory options consumable. If not you'd have to manage the memory indirectly by managing CPUs requested. For example, if you have 68gb and 36 cores per node, you have 1.8 Gb/core so if you wanted a job to reserve 30gb you'd need it to request 16 cores. You'd have to manually do the math based on the initial Canu config it prints and update all the threads used by each step.
Idle, machines in cluster seem to be overloaded because the resources weren't being properly reserved.
Hi, I am assembling from ~70X coverage nanopore data with canu 1.9. I used the canu command
I've set up a parallel cluster on aws with 10 compute nodes each with 36 cores and 70Gb of RAM with an sge scheduler. The test E. coli data assembled fine, but my actual data gets stuck at cormhap. The jobs are submitted and continue to say running, however when I ssh into the compute nodes there is nothing running. I've restarted multiple times by deleting
correction/1-overlapper/results/*mhap*
but it doesn't seem to change anything.The mhap.out files don't have any errors but they say
killed
at the bottom mhap.1.txtHere are all the jobs that still say they are running
But none of them are actually running on any of the compute nodes
Here is the canu.out