Jobs stuck at mhap - Githubissues

gmoneyomics commented 4 years ago

Hi, I am assembling from ~70X coverage nanopore data with canu 1.9. I used the canu command

~/canu-1.9/Linux-amd64/bin/canu \
 -p a_colubris_canu -d /shared/hummingbird \
 java=/shared/jdk1.8.0_241/bin/java \
 genomeSize=1g \
 -nanopore-raw /shared/hbird_all.fastq \
  gridEngineResourceOption="-l mem_free=MEMORY" \
  minReadLength=5000

I've set up a parallel cluster on aws with 10 compute nodes each with 36 cores and 70Gb of RAM with an sge scheduler. The test E. coli data assembled fine, but my actual data gets stuck at cormhap. The jobs are submitted and continue to say running, however when I ssh into the compute nodes there is nothing running. I've restarted multiple times by deleting correction/1-overlapper/results/*mhap* but it doesn't seem to change anything.

The mhap.out files don't have any errors but they say killed at the bottom mhap.1.txt

Here are all the jobs that still say they are running

job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 2
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 5
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 6
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 8
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 12
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 13
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 16
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-30-155.ec2.inter     1 18
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 19
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 20
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 22
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-29-182.ec2.inter     1 23
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 26
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 27
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 33
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 34
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 36
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 40
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 41
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 47
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 48
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-29-182.ec2.inter     1 51
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 54
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 55
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 58
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-30-155.ec2.inter     1 60
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 61
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 62
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 64
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-29-182.ec2.inter     1 65
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 68
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 69
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-29-182.ec2.inter     1 71
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 72
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-30-155.ec2.inter     1 74
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 75
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 76
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 78
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 82
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 83
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 86
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-30-155.ec2.inter     1 88
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 89
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 90
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 92
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-30-155.ec2.inter     1 94
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 96
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 97
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-29-182.ec2.inter     1 99
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-30-155.ec2.inter     1 102
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 103
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 104
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 106
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-30-155.ec2.inter     1 108
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 110
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 111
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 117
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 118
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 124
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 125
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-29-182.ec2.inter     1 127
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 131
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 132
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-29-182.ec2.inter     1 135
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 138
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 139
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 145
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 146
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-29-182.ec2.inter     1 149
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 152
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 153
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 159
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 160
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 166
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 167
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 170
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 173
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 174
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 180
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 181
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 184
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-30-155.ec2.inter     1 186
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 187
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 188
     72 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-29-182.ec2.inter     1 191
     73 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 195
     73 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 196
     73 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 199
     73 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 202
     73 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 203
     73 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 205
     73 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 209
     73 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 210
     75 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-16-4.ec2.interna     1 215
     75 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-30-155.ec2.inter     1 217
     75 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 218
     75 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 219
     76 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-29-182.ec2.inter     1 223
     76 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-30-155.ec2.inter     1 224
     76 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 226
     76 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 227
     76 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-31-222.ec2.inter     1 228
     76 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-29-182.ec2.inter     1 229
     76 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-30-155.ec2.inter     1 232
     76 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 233
     76 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-22-43.ec2.intern     1 234
     76 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-31-222.ec2.inter     1 235
     76 0.55500 cormhap_a_ ubuntu       r     02/06/2020 13:43:21 all.q@ip-10-0-25-82.ec2.intern     1 240
     77 0.00000 canu_a_col ubuntu       hqw   02/06/2020 13:43:08

But none of them are actually running on any of the compute nodes

HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
ip-10-0-16-109          lx-amd64       36    1   18   36  0.00   68.7G  568.9M     0.0     0.0
ip-10-0-16-156          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-16-4            lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-17-205          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-18-109          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-18-149          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-18-60           lx-amd64       36    1   18   36  0.02   68.7G  565.5M     0.0     0.0
ip-10-0-20-144          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-20-207          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-21-98           lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-22-242          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-22-43           lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-25-221          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-25-82           lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-26-174          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-28-212          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-29-103          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-29-182          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-29-189          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-30-155          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-31-142          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-31-213          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-31-222          lx-amd64       36    1   18   36     -   68.7G       -     0.0       -
ip-10-0-31-34           lx-amd64       36    1   18   36  2.65   68.7G  662.8M     0.0     0.0
ip-10-0-31-80           lx-amd64       36    1   18   36     -   68.7G       -     0.0       -```

Here is the canu.out


--
-- Detected Java(TM) Runtime Environment '1.8.0_241' (from '/shared/jdk1.8.0_241/bin/java') with -d64 support.
--
-- WARNING:
-- WARNING:  Failed to run gnuplot using command 'gnuplot'.
-- WARNING:  Plots will be disabled.
-- WARNING:
--
-- Detected 36 CPUs and 69 gigabytes of memory.
-- Detected Sun Grid Engine in '/opt/sge/default'.
-- No Sun Grid Engine parallel environment detected in gridEngineResourceOption.
-- User supplied Memory Resource      'mem_free'.
--
-- Found  23 hosts with  36 cores and   68 GB memory under Sun Grid Engine control.
--
--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl     34 GB    8 CPUs  (k-mer counting)
-- Grid:  hap       16 GB   36 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   22 GB   12 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    16 GB   12 CPUs  (overlap detection)
-- Grid:  utgovl    16 GB   12 CPUs  (overlap detection)
-- Grid:  cor       24 GB    4 CPUs  (read correction)
-- Grid:  ovb        4 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32 GB    1 CPU   (overlap store sorting)
-- Grid:  red       17 GB    8 CPUs  (read error detection)
-- Grid:  oea        8 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       68 GB   16 CPUs  (contig construction with bogart)
-- Grid:  cns      --- GB    8 CPUs  (consensus)
-- Grid:  gfa       32 GB   16 CPUs  (GFA alignment and proc
-- In 'a_colubris_canu.seqStore', found Nanopore reads:
--   Raw:        5729276
--   Corrected:  0
--   Trimmed:    0
--
-- Generating assembly 'a_colubris_canu' in '/shared/hummingbird'
--
-- Parameters:
--
--  genomeSize        1000000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.1200 ( 12.00%)
--    utgOvlErrorRate 0.1200 ( 12.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.1200 ( 12.00%)
--    utgErrorRate    0.1200 ( 12.00%)
--    cnsErrorRate    0.2000 ( 20.00%)
--
--
-- BEGIN CORRECTION
--
-- No change in report.
--
-- OVERLAPPER (mhap) (correction) complete, not rewriting scripts.
--
-- No change in report.
--
-- Running jobs.  First attempt out of 2.
--
-- 'mhap.jobSubmit-01.sh' -> job 72 tasks 1-192.
-- 'mhap.jobSubmit-02.sh' -> job 73 tasks 194-210.
-- 'mhap.jobSubmit-03.sh' -> job 74 tasks 212-213.
-- 'mhap.jobSubmit-04.sh' -> job 75 tasks 215-219.
-- 'mhap.jobSubmit-05.sh' -> job 76 tasks 221-240.
--
----------------------------------------
-- Starting command on Thu Feb  6 13:43:08 2020 with 7139.092 GB free disk space

    cd /shared/hummingbird
    qsub \
      -hold_jid 72,73,74,75,76 \
      -l mem_free=4g   \
      -cwd \
      -N 'canu_a_colubris_canu' \
      -j y \
      -o canu-scripts/canu.10.out  canu-scripts/canu.10.sh
Your job 77 ("canu_a_colubris_canu") has been submitted

-- Finished on Thu Feb  6 13:43:08 2020 (like a bat out of hell) with 7139.092 GB free disk space
----------------------------------------```

skoren commented 4 years ago

I think this is an issue with your grid configuration. The resource option you supplied only includes memory but not the parallel thread environment. Normally mem_free reserves memory per core not per job. Thus, canu will scale the total request (22gb in this case) to divide it by the cores (12). So I expect each of your jobs is only requesting 22/12 = 1gb of memory, exceeding that, and getting killed. qacct -j 72 should show more info on the resources the job requested/used.

Modify your gridEngineResourceOption to include "-pe <parallel env> THREADS", you can use qconf -spl to find one (it has to be set to allocate by pe_slots, not round-robin). Depending on how your JVM is set up, it may also be over-reserving memory for system overheads so you may also need to add gridOptionscormhap="-l mem_free=30g.

gmoneyomics commented 4 years ago

#!/bin/bash 
~/canu-1.9/Linux-amd64/bin/canu \
 -p a_colubris_canu -d /shared/hummingbird \
 java=/shared/jdk1.8.0_241/bin/java \
 genomeSize=1g \
 -nanopore-raw /shared/hbird_all.fastq \
  gridEngineResourceOption="-l mem_free=MEMORY -pe THREADS" \
  gridOptionscormhap="-l mem_free=30g" \
  minReadLength=5000

Do I specify number of threads?



    cd /shared/hummingbird
    qsub \
      -l mem_free=4g \
      -pe 1   \
      -cwd \
      -N 'canu_a_colubris_canu' \
      -j y \
      -o canu-scripts/canu.03.out  canu-scripts/canu.03.sh
qsub: Numerical value invalid!
The initial portion of string "cwd" contains no decimal number

-- Finished on Thu Feb  6 15:00:59 2020 (in the blink of an eye) with 7103.832 GB free disk space
----------------------------------------

ERROR:
ERROR:  Failed with exit code 7.  (rc=1792)
ERROR:
-- Failed to submit Canu executive.  Delay 10 seconds and try again.
----------------------------------------
-- Starting command on Thu Feb  6 15:01:09 2020 with 7103.832 GB free disk space

    cd /shared/hummingbird
    qsub \
      -l mem_free=4g \
      -pe 1   \
      -cwd \
      -N 'canu_a_colubris_canu' \
      -j y \
      -o canu-scripts/canu.03.out  canu-scripts/canu.03.sh
qsub: Numerical value invalid!
The initial portion of string "cwd" contains no decimal number

-- Finished on Thu Feb  6 15:01:09 2020 (in the blink of an eye) with 7103.832 GB free disk space
----------------------------------------

ERROR:
ERROR:  Failed with exit code 7.  (rc=1792)
ERROR:
-- Failed to submit Canu executive.  Giving up after two tries.```

skoren commented 4 years ago

Github messed up my original response, the pe option requires a parallel environment name which varies from system to system, you can check with the qconf command I posted originally. Then you need to use "-pe <whatever name you found maybe smp> THREADS"

gmoneyomics commented 4 years ago

~/canu-1.9/Linux-amd64/bin/canu \
 -p a_colubris_canu -d /shared/hummingbird \
 java=/shared/jdk1.8.0_241/bin/java \
 genomeSize=1g \
 -nanopore-raw /shared/hbird_all.fastq \
  gridEngineResourceOption="-pe smp THREADS -l mem_free=MEMORY" \
  gridOptionscormhap="-l mem_free=30g" \
  minReadLength=5000

The pe names were smp, mpi and make mpi and make gave qsub errors but smp was submitted but not canu cannot find the data

canu.txt

skoren commented 4 years ago

None of the added options would change what reads Canu can find. Did you remove any output files from the previous runs? I'd suggest just removing the current run you have and starting with the updated command from scratch, you weren't very far in the assembly anyway.

gmoneyomics commented 4 years ago

Thank you! It seems to be working now, and thanks for getting back to me so quickly! This isn't canu related but the nodes seem to be going unresponsive

queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@ip-10-0-16-109.ec2.inter BIP   0/1/36         0.21     lx-amd64      
     94 0.55500 canu_a_col ubuntu       r     02/06/2020 15:35:51     1        
---------------------------------------------------------------------------------
all.q@ip-10-0-16-156.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-16-4.ec2.interna BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-17-205.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-18-109.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-18-149.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-18-60.ec2.intern BIP   0/0/36         0.00     lx-amd64      
---------------------------------------------------------------------------------
all.q@ip-10-0-20-144.ec2.inter BIP   0/0/36         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
all.q@ip-10-0-20-207.ec2.inter BIP   0/0/36         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
all.q@ip-10-0-21-98.ec2.intern BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-22-242.ec2.inter BIP   0/0/36         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
all.q@ip-10-0-22-43.ec2.intern BIP   0/0/36         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
all.q@ip-10-0-25-221.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-25-82.ec2.intern BIP   0/0/36         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
all.q@ip-10-0-26-174.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-28-212.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-29-103.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-29-182.ec2.inter BIP   0/0/36         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
all.q@ip-10-0-29-189.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-30-155.ec2.inter BIP   0/0/36         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
all.q@ip-10-0-31-142.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-31-213.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-31-222.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-31-34.ec2.intern BIP   0/0/36         0.00     lx-amd64      
---------------------------------------------------------------------------------
all.q@ip-10-0-31-80.ec2.intern BIP   0/0/1          -NA-     lx-amd64      auo

there is supposed to be 10, is there a way to get them back?

skoren commented 4 years ago

Not sure, if they're unresponsive they might be overloaded by jobs or the node that is running the main SGE manager is overloaded. You can try to use the ganglia report to see what memory/etc usage the nodes have.

gmoneyomics commented 4 years ago

I didn't configure the instance with ganglia so I'm not sure I can add it now. I did qdel -u to clear all the jobs it had running but I have attempted to run canu many times on this instance now so it is possible that its overloaded.

queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@ip-10-0-16-109.ec2.inter BIP   0/36/36        12.29    lx-amd64      
     99 0.55500 cormhap_a_ ubuntu       r     02/06/2020 16:30:51    12 3
     99 0.55500 cormhap_a_ ubuntu       r     02/06/2020 16:30:51    12 6
     99 0.55500 cormhap_a_ ubuntu       r     02/06/2020 16:30:51    12 9
---------------------------------------------------------------------------------
all.q@ip-10-0-16-156.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-16-4.ec2.interna BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-17-205.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-18-109.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-18-149.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-18-60.ec2.intern BIP   0/36/36        11.70    lx-amd64      
     99 0.55500 cormhap_a_ ubuntu       r     02/06/2020 16:30:51    12 1
     99 0.55500 cormhap_a_ ubuntu       r     02/06/2020 16:30:51    12 4
     99 0.55500 cormhap_a_ ubuntu       r     02/06/2020 16:30:51    12 7
---------------------------------------------------------------------------------
all.q@ip-10-0-20-144.ec2.inter BIP   0/0/36         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
all.q@ip-10-0-20-207.ec2.inter BIP   0/0/36         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
all.q@ip-10-0-21-98.ec2.intern BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-22-242.ec2.inter BIP   0/0/36         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
all.q@ip-10-0-22-43.ec2.intern BIP   0/0/36         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
all.q@ip-10-0-25-221.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-25-82.ec2.intern BIP   0/0/36         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
all.q@ip-10-0-26-174.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-28-212.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-29-103.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-29-182.ec2.inter BIP   0/0/36         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
all.q@ip-10-0-29-189.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-30-155.ec2.inter BIP   0/0/36         -NA-     lx-amd64      au
---------------------------------------------------------------------------------
all.q@ip-10-0-31-142.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-31-213.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-31-222.ec2.inter BIP   0/0/1          -NA-     lx-amd64      auo
---------------------------------------------------------------------------------
all.q@ip-10-0-31-34.ec2.intern BIP   0/36/36        11.07    lx-amd64      
     99 0.55500 cormhap_a_ ubuntu       r     02/06/2020 16:30:51    12 2
     99 0.55500 cormhap_a_ ubuntu       r     02/06/2020 16:30:51    12 5
     99 0.55500 cormhap_a_ ubuntu       r     02/06/2020 16:30:51    12 8
---------------------------------------------------------------------------------
all.q@ip-10-0-31-80.ec2.intern BIP   0/0/1          -NA-     lx-amd64      auo

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
     99 0.55500 cormhap_a_ ubuntu       qw    02/06/2020 16:30:40    12 10-120:1
    100 0.00000 canu_a_col ubuntu       hqw   02/06/2020 16:30:40     1

It does look like canu is working now, its just all the nodes aren't reachable. But, thank you for all of your help!

gmoneyomics commented 4 years ago

Hi. I started a new cluster and started canu over again using the script we talked about before:

~/canu-1.9/Linux-amd64/bin/canu \
 -p a_colubris_canu -d /shared/hummingbird \
 java=/home/ubuntu/jdk-11.0.5/bin/java \
 genomeSize=1g \
 -nanopore-raw /shared/hbird_gDNA_all.fastq \
  gridEngineResourceOption="-pe smp THREADS -l mem_free=MEMORY" \
  gridOptionscormhap="-l mem_free=30g" \
  minReadLength=2000

Everything looked great at first, it was allocating enough memory to all the jobs and successfully did cormhap precompute, but when it got to cormhap some of the nodes are dying again but this time with unfinished jobs in them. These still have .WORKING files and the jobs seem to be just stalled in there. Will canu automatically detect that these are unfinished and resubmit them or should I do it manually?

queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
all.q@ip-10-0-17-232.ec2.inter BIP   0/36/1         -NA-     lx-amd64      auo
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 00:52:15    12 37
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 00:52:15    12 38
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 02:58:30    12 65
---------------------------------------------------------------------------------
all.q@ip-10-0-18-216.ec2.inter BIP   0/36/36        5.53     lx-amd64      
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 12:31:00    12 153
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 12:37:00    12 155
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 14:00:30    12 166
---------------------------------------------------------------------------------
all.q@ip-10-0-18-4.ec2.interna BIP   0/36/36        5.47     lx-amd64      
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 12:49:15    12 160
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 12:49:15    12 161
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 12:49:15    12 162
---------------------------------------------------------------------------------
all.q@ip-10-0-19-120.ec2.inter BIP   0/36/1         -NA-     lx-amd64      auo
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 02:01:30    12 58
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 02:01:45    12 59
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 03:58:15    12 71
---------------------------------------------------------------------------------
all.q@ip-10-0-19-14.ec2.intern BIP   0/36/36        6.63     lx-amd64      
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 12:20:45    12 148
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 12:36:45    12 154
      9 0.55500 cormhap_a_ ubuntu       t     02/07/2020 14:53:45    12 173
---------------------------------------------------------------------------------
all.q@ip-10-0-19-81.ec2.intern BIP   0/36/36        5.19     lx-amd64      
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 11:07:15    12 139
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 13:03:45    12 163
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 13:40:00    12 165
---------------------------------------------------------------------------------
all.q@ip-10-0-20-148.ec2.inter BIP   0/36/1         -NA-     lx-amd64      auo
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 02:18:45    12 61
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 02:19:00    12 62
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 03:36:30    12 69
---------------------------------------------------------------------------------
all.q@ip-10-0-21-141.ec2.inter BIP   0/36/36        5.96     lx-amd64      
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 11:16:00    12 141
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 12:48:00    12 159
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 14:01:15    12 167
---------------------------------------------------------------------------------
all.q@ip-10-0-21-9.ec2.interna BIP   0/36/36        5.11     lx-amd64      
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 11:21:00    12 142
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 12:29:15    12 150
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 12:39:45    12 156
---------------------------------------------------------------------------------
all.q@ip-10-0-22-136.ec2.inter BIP   0/36/1         -NA-     lx-amd64      auo
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 05:17:00    12 89
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 05:33:30    12 93
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 07:35:00    12 110
---------------------------------------------------------------------------------
all.q@ip-10-0-22-196.ec2.inter BIP   0/36/1         -NA-     lx-amd64      auo
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 00:56:00    12 41
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 01:05:00    12 45
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 01:12:45    12 51
---------------------------------------------------------------------------------
all.q@ip-10-0-23-219.ec2.inter BIP   0/36/1         -NA-     lx-amd64      auo
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 08:20:30    12 115
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 08:40:00    12 122
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 09:23:00    12 132
---------------------------------------------------------------------------------
all.q@ip-10-0-26-117.ec2.inter BIP   0/36/1         -NA-     lx-amd64      auo
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 04:14:30    12 73
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 04:14:45    12 74
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 05:56:30    12 97
---------------------------------------------------------------------------------
all.q@ip-10-0-26-24.ec2.intern BIP   0/36/36        6.89     lx-amd64      
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 13:29:30    12 164
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 14:19:00    12 168
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 14:28:45    12 170
---------------------------------------------------------------------------------
all.q@ip-10-0-27-206.ec2.inter BIP   0/36/1         -NA-     lx-amd64      auo
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 01:16:30    12 53
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 01:21:30    12 54
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 04:30:30    12 84
---------------------------------------------------------------------------------
all.q@ip-10-0-27-97.ec2.intern BIP   0/36/36        6.33     lx-amd64      
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 12:19:15    12 147
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 12:44:45    12 157
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 12:46:30    12 158
---------------------------------------------------------------------------------
all.q@ip-10-0-29-12.ec2.intern BIP   0/36/36        4.54     lx-amd64      
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 12:30:15    12 151
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 14:41:30    12 171
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 14:41:45    12 172
---------------------------------------------------------------------------------
all.q@ip-10-0-30-52.ec2.intern BIP   0/36/1         -NA-     lx-amd64      auo
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 00:37:15    12 33
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 00:37:30    12 34
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 01:24:45    12 57
---------------------------------------------------------------------------------
all.q@ip-10-0-31-140.ec2.inter BIP   0/36/36        6.58     lx-amd64      
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 11:44:00    12 144
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 12:15:15    12 145
      9 0.55500 cormhap_a_ ubuntu       r     02/07/2020 14:22:30    12 169

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
      9 0.55500 cormhap_a_ ubuntu       qw    02/06/2020 23:50:16    12 174-297:1
     10 0.00000 canu_a_col ubuntu       hqw   02/06/2020 23:50:16     1

skoren commented 4 years ago

Canu will detect and resume unfinished jobs, you should try to track down why the nodes are dying.

I don't think mem_free is working correctly on your system. I see 3 jobs per node and your nodes have 36 cores and 68gb ram. Each cormhap job is requesting 12 cores and 30gb of ram/core (due to the gridOptionscormhap="-l mem_free=30g" option). This means technically none of them should fit on a node. Even if the memory is not requested per core on your system, only two should fit per node (68/30 = 2). I would guess your mem_free option isn't a consumable resource so the memory is checked at the start of the run but not actually reserved for a process so the nodes are effectively over-subscribed. You should check the memory settings with qconf -sc and look for a memory option that is consumable:

#name               shortcut   type        relop requestable consumable default  urgency 
#----------------------------------------------------------------------------------------
h_vmem              h_vmem     MEMORY      <=    YES         NO         0        0
mem_free            mf         MEMORY      <=    YES         YES        0        0

For example given the above you want to use mem_free not h_vmem.

gmoneyomics commented 4 years ago

so in this case I would change the command to -l slots=30ginstead of mem_free

#name               shortcut   type      relop requestable consumable default  urgency 
#--------------------------------------------------------------------------------------
arch                a          STRING    ==    YES         NO         NONE     0
calendar            c          STRING    ==    YES         NO         NONE     0
cpu                 cpu        DOUBLE    >=    YES         NO         0        0
display_win_gui     dwg        BOOL      ==    YES         NO         0        0
h_core              h_core     MEMORY    <=    YES         NO         0        0
h_cpu               h_cpu      TIME      <=    YES         NO         0:0:0    0
h_data              h_data     MEMORY    <=    YES         NO         0        0
h_fsize             h_fsize    MEMORY    <=    YES         NO         0        0
h_rss               h_rss      MEMORY    <=    YES         NO         0        0
h_rt                h_rt       TIME      <=    YES         NO         0:0:0    0
h_stack             h_stack    MEMORY    <=    YES         NO         0        0
h_vmem              h_vmem     MEMORY    <=    YES         NO         0        0
hostname            h          HOST      ==    YES         NO         NONE     0
load_avg            la         DOUBLE    >=    NO          NO         0        0
load_long           ll         DOUBLE    >=    NO          NO         0        0
load_medium         lm         DOUBLE    >=    NO          NO         0        0
load_short          ls         DOUBLE    >=    NO          NO         0        0
m_core              core       INT       <=    YES         NO         0        0
m_socket            socket     INT       <=    YES         NO         0        0
m_thread            thread     INT       <=    YES         NO         0        0
m_topology          topo       STRING    ==    YES         NO         NONE     0
m_topology_inuse    utopo      STRING    ==    YES         NO         NONE     0
mem_free            mf         MEMORY    <=    YES         NO         0        0
mem_total           mt         MEMORY    <=    YES         NO         0        0
mem_used            mu         MEMORY    >=    YES         NO         0        0
min_cpu_interval    mci        TIME      <=    NO          NO         0:0:0    0
np_load_avg         nla        DOUBLE    >=    NO          NO         0        0
np_load_long        nll        DOUBLE    >=    NO          NO         0        0
np_load_medium      nlm        DOUBLE    >=    NO          NO         0        0
np_load_short       nls        DOUBLE    >=    NO          NO         0        0
num_proc            p          INT       ==    YES         NO         0        0
qname               q          STRING    ==    YES         NO         NONE     0
rerun               re         BOOL      ==    NO          NO         0        0
s_core              s_core     MEMORY    <=    YES         NO         0        0
s_cpu               s_cpu      TIME      <=    YES         NO         0:0:0    0
s_data              s_data     MEMORY    <=    YES         NO         0        0
s_fsize             s_fsize    MEMORY    <=    YES         NO         0        0
s_rss               s_rss      MEMORY    <=    YES         NO         0        0
s_rt                s_rt       TIME      <=    YES         NO         0:0:0    0
s_stack             s_stack    MEMORY    <=    YES         NO         0        0
s_vmem              s_vmem     MEMORY    <=    YES         NO         0        0
seq_no              seq        INT       ==    NO          NO         0        0
slots               s          INT       <=    YES         YES        1        1000
swap_free           sf         MEMORY    <=    YES         NO         0        0
swap_rate           sr         MEMORY    >=    YES         NO         0        0
swap_rsvd           srsv       MEMORY    >=    YES         NO         0        0
swap_total          st         MEMORY    <=    YES         NO         0        0
swap_used           su         MEMORY    >=    YES         NO         0        0
tmpdir              tmp        STRING    ==    NO          NO         NONE     0
virtual_free        vf         MEMORY    <=    YES         NO         0        0
virtual_total       vt         MEMORY    <=    YES         NO         0        0
virtual_used        vu         MEMORY    >=    YES         NO         0        0
# >#< starts a comment but comments are not saved across edits --------

skoren commented 4 years ago

No slots is not a memory option. I would suggest editing the settings to make one of the memory options consumable. If not you'd have to manage the memory indirectly by managing CPUs requested. For example, if you have 68gb and 36 cores per node, you have 1.8 Gb/core so if you wanted a job to reserve 30gb you'd need it to request 16 cores. You'd have to manually do the math based on the initial Canu config it prints and update all the threads used by each step.

skoren commented 4 years ago

Idle, machines in cluster seem to be overloaded because the resources weren't being properly reserved.

marbl / canu

Jobs stuck at mhap #1617