marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
653 stars 179 forks source link

Exit code 7 on SGE? #1279

Closed GeoMicroSoares closed 5 years ago

GeoMicroSoares commented 5 years ago

Running ./canu/Linux-amd64/bin/canu -p metagenome -d /scratch/metagenome_folder genomeSize=1g -nanopore-raw metagenome_folder/metagenome.fastq, I'm getting a Error 7 as follows:


    cd /scratch/metagenome_folder
    qsub \
      -pe multithread 1 \
      -l =4g \
      -S /bin/sh  \
      -cwd \
      -N 'canu_metagenome' \
      -j y \
      -o canu-scripts/canu.01.out  canu-scripts/canu.01.sh
error: unknown resource ""
qsub: ERROR! Wrong resource list format "=4g" specified to -l option

-- Finished on Tue Mar 12 12:24:53 2019 (in the blink of an eye) with 1885.167 GB free disk space
----------------------------------------

ERROR:
ERROR:  Failed with exit code 7.  (rc=1792)
ERROR:
-- Failed to submit Canu executive.  Delay 10 seconds and try again.
----------------------------------------
-- Starting command on Tue Mar 12 12:25:03 2019 with 1885.151 GB free disk space

    cd /scratch/metagenome_folder
    qsub \
      -pe multithread 1 \
      -l =4g \
      -S /bin/sh  \
      -cwd \
      -N 'canu_metagenome' \
      -j y \
      -o canu-scripts/canu.01.out  canu-scripts/canu.01.sh
error: unknown resource ""
qsub: ERROR! Wrong resource list format "=4g" specified to -l option

-- Finished on Tue Mar 12 12:25:03 2019 (fast as lightning) with 1885.151 GB free disk space
----------------------------------------

ERROR:
ERROR:  Failed with exit code 7.  (rc=1792)
ERROR:

CRASH:
CRASH: Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)
CRASH: Please panic, this is abnormal.
ABORT:
CRASH:   Failed to submit Canu executive.
CRASH:
CRASH: Failed at /canu/Linux-amd64/bin/../lib/site_perl/canu/Execution.pm line 794
CRASH:  canu::Execution::submitScript('metagenome', undef) called at ./canu/Linux-amd64/bin/canu line 798
CRASH:
CRASH: No log file supplied.
CRASH:

Any suggestions?

Canu version:

Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Ran on a Scientific Linux SGE server:

$ lsb_release -a
LSB Version:    :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: Scientific
Description:    Scientific Linux release 6.8 (Carbon)
Release:        6.8
Codename:       Carbon
brianwalenz commented 5 years ago

That option is attempting to request memory from SGE; the option should be something like "-l memory=4g", but you're missing the name, somehow.

Near the top of the canu screen output (just after the CITATIONS) canu will report what it thinks your grid is using, something like:

-- Detected Sun Grid Engine in '/usr/local/sge/grid'.
-- Detected Sun Grid Engine parallel environment 'thread'.
-- Detected Sun Grid Engine memory resource 'memory'.

The 'memory' name is coming from the output of qconf -sc:

> qconf -sc | grep MEMORY
[...]
mem_used            mu         MEMORY    >=    YES         NO         0        0
memory              memory     MEMORY    <=    YES         YES        0        0
[...]

(the line with YES in the 6th column is used).

What does your canu and qconf -sc report?

GeoMicroSoares commented 5 years ago

Hi @brianwalenz,

Yep, things run after checking that out and changing the command to ./canu/Linux-amd64/bin/canu -p metagenome -d /scratch/metagenome_folder genomeSize=1g -nanopore-raw metagenome_folder/metagenome.fastq java='/cm/shared/apps/java/jdk1.8.0_40/bin/java' gridEngineResourceOption="-pe multithread THREADS -l mem_free=MEMORY". The job gets submitted but ca. 5min later it crashes.

Checking it out I get a java-related error:


Complete documentation at http://canu.readthedocs.org/en/latest/

ERROR:  mhap overlapper requires java version at least 1.8.0; you have unknown (from '/cm/shared/apps/java/jdk1.8.0_40/bin/java').
ERROR:  '/cm/shared/apps/java/jdk1.8.0_40/bin/java -Xmx1g -showversion' reports:
ERROR:    'Error occurred during initialization of VM'
ERROR:    'Unable to allocate 32768KB bitmaps for parallel garbage collection for the requested 1048576KB heap.'
ERROR:    'Error: Could not create the Java Virtual Machine.'
ERROR:    'Error: A fatal exception has occurred. Program will exit.'

Testing the java version:

$ /cm/shared/apps/java/jdk1.8.0_40/bin/java -version
java version "1.8.0_40"
Java(TM) SE Runtime Environment (build 1.8.0_40-b26)
Java HotSpot(TM) 64-Bit Server VM (build 25.40-b25, mixed mode)

Not sure what's going on here, if this happens even after I indicate where java is and have the correct version...

GeoMicroSoares commented 5 years ago

Hi again,

Do let me know if you'd rather me open up another issue here, but managed to get through that hurdle and now getting Mhap precomputing jobs failing with ./canu/Linux-amd64/bin/canu -p metagenome -d /scratch/metagenome_folder genomeSize=0.5g -nanopore-raw metagenome_folder/metagenome.fastq java="/cm/shared/apps/java/jdk1.8.0_40/bin/java" gridEngineResourceOption="-pe multithread THREADS -l h_vmem=MEMORY" merylMemory=10 merylThreads=1 cormhapMemory=10 cormhapThreads=1 gridOptionsExecutive="-l h_vmem=100g" gridEngineArrayMaxJobs=5

Bottom of error log here:

$ tail /scratch/metagenome_folder/canu.out -n100
-- WARNING:  If jobs fail to start, modify the above option to use a valid shell
-- WARNING:  and supply it directly to canu.
-- WARNING:
-- 
-- Found   1 host  with  32 cores and  504 GB memory under Sun Grid Engine control.
-- Found   3 hosts with  64 cores and  252 GB memory under Sun Grid Engine control.
-- Found   1 host  with  16 cores and 1009 GB memory under Sun Grid Engine control.
-- Found   2 hosts with   8 cores and  189 GB memory under Sun Grid Engine control.
-- Found   1 host  with  64 cores and  236 GB memory under Sun Grid Engine control.
-- Found   1 host  with   8 cores and   23 GB memory under Sun Grid Engine control.
-- Found   3 hosts with  32 cores and   94 GB memory under Sun Grid Engine control.
-- Found   1 host  with  64 cores and  504 GB memory under Sun Grid Engine control.
--
--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl     10 GB    1 CPU   (k-mer counting)
-- Grid:  hap       12 GB    8 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   10 GB    1 CPU   (overlap detection with mhap)
-- Grid:  obtovl    16 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl    16 GB    8 CPUs  (overlap detection)
-- Grid:  cor      --- GB    4 CPUs  (read correction)
-- Grid:  ovb        4 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       16 GB    1 CPU   (overlap store sorting)
-- Grid:  red       12 GB    8 CPUs  (read error detection)
-- Grid:  oea        4 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      256 GB   16 CPUs  (contig construction with bogart)
-- Grid:  cns      --- GB    8 CPUs  (consensus)
-- Grid:  gfa       16 GB   16 CPUs  (GFA alignment and processing)
--
-- In 'metagenome.seqStore', found Nanopore reads:
--   Raw:        459052
--   Corrected:  0
--   Trimmed:    0
--
-- Generating assembly 'metagenome' in '/scratch/metagenome_folder'
--
-- Parameters:
--
--  genomeSize        500000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.1200 ( 12.00%)
--    utgOvlErrorRate 0.1200 ( 12.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.1200 ( 12.00%)
--    utgErrorRate    0.1200 ( 12.00%)
--    cnsErrorRate    0.2000 ( 20.00%)
--
--
-- BEGIN CORRECTION
--
--
-- Mhap precompute jobs failed, tried 2 times, giving up.
--   job correction/1-overlapper/blocks/000001.dat FAILED.
--   job correction/1-overlapper/blocks/000002.dat FAILED.
--   job correction/1-overlapper/blocks/000003.dat FAILED.
--   job correction/1-overlapper/blocks/000004.dat FAILED.
--   job correction/1-overlapper/blocks/000005.dat FAILED.
--   job correction/1-overlapper/blocks/000006.dat FAILED.
--   job correction/1-overlapper/blocks/000007.dat FAILED.
--   job correction/1-overlapper/blocks/000008.dat FAILED.
--   job correction/1-overlapper/blocks/000009.dat FAILED.
--   job correction/1-overlapper/blocks/000010.dat FAILED.
--   job correction/1-overlapper/blocks/000011.dat FAILED.
--   job correction/1-overlapper/blocks/000012.dat FAILED.
--   job correction/1-overlapper/blocks/000013.dat FAILED.
--   job correction/1-overlapper/blocks/000014.dat FAILED.
--   job correction/1-overlapper/blocks/000015.dat FAILED.
--   job correction/1-overlapper/blocks/000016.dat FAILED.
--   job correction/1-overlapper/blocks/000017.dat FAILED.
--   job correction/1-overlapper/blocks/000018.dat FAILED.
--   job correction/1-overlapper/blocks/000019.dat FAILED.
--   job correction/1-overlapper/blocks/000020.dat FAILED.
--   job correction/1-overlapper/blocks/000021.dat FAILED.
--   job correction/1-overlapper/blocks/000022.dat FAILED.
--   job correction/1-overlapper/blocks/000023.dat FAILED.
--   job correction/1-overlapper/blocks/000024.dat FAILED.
--   job correction/1-overlapper/blocks/000025.dat FAILED.
--   job correction/1-overlapper/blocks/000026.dat FAILED.
--   job correction/1-overlapper/blocks/000027.dat FAILED.
--   job correction/1-overlapper/blocks/000028.dat FAILED.
--   job correction/1-overlapper/blocks/000029.dat FAILED.
--   job correction/1-overlapper/blocks/000030.dat FAILED.
--   job correction/1-overlapper/blocks/000031.dat FAILED.
--   job correction/1-overlapper/blocks/000032.dat FAILED.
--   job correction/1-overlapper/blocks/000033.dat FAILED.
--   job correction/1-overlapper/blocks/000034.dat FAILED.
--   job correction/1-overlapper/blocks/000035.dat FAILED.
--

ABORT:
ABORT: Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

Tried re-running without success and tweaking the command based on previous issues but without sucess. Help?

brianwalenz commented 5 years ago

I'm going to guess it's still a java problem. Look in the correction/1-overlapper/mhap*out for errors.

It could also be the 'start method' that the logging is complaining about at the start of your paste. Two lines before that is a suggestion to set gridOptions="-S /bin/bash" (I'm guessing on '/bin/bash'); maybe we're picking an invalid shell.

GeoMicroSoares commented 5 years ago

Ok then, from cat /ibers/ernie/scratch/correction/1-overlapper/precompute.1.out I'm getting mixed messages. In some instances, getting java fail errors as such:

Found perl:
   /usr/bin/perl

Found java:
   /cm/shared/apps/java/jdk1.8.0_40/bin/java
   Error occurred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 1 based on SGE_TASK_ID=1 and offset=0.
Dumping reads from 1 to 13500 (inclusive).

Starting mhap precompute.

Error occurred during initialization of VM
Could not reserve enough space for 9437184KB object heap

Whilst in other instances it seems java is found but the VM still can't be initialized.

Found perl:
   /usr/bin/perl

Found java:
   /cm/shared/apps/java/jdk1.8.0_40/bin/java

Found perl:
on "1.8.0_40"

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linu
Found perl:
./lib/site_perl/canu/Grid_Cloud.pm line 74.
   /usr/bin/perl
v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Found java:
ased on SGE_TASK_ID=1 and offset=30.
Dumping reads from 405001 to 418500 (inclusiv   java version "1.8.0_40"

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
enough space for 9437184KB object heap
Mhap failed.
   java version "1.8.0_40"

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 6 based on SGE_TASK_ID=1 and offset=5.
Dumping reads from 67501 to 81000 (inclusive).
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 1 based on SGE_TASK_ID=1 and offset=0.
Dumping reads from 1 to 13500 (inclusive).

Starting mhap precompute.

Error occurred during initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.

Starting mhap precompute.

Error occurred during initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.
skoren commented 5 years ago

In all cases it looks like your JVM is failing to start up. It is trying to reserve 10gb of memory and failing. This is what you configured it to use via cormhapMemory=10 cormhapThreads=1. The version check explicitly only requests 1gb of memory and it still seems to fail on your nodes.

I noticed you also switched from mem_free to h_vmem, why is that? h_vmem is usually not the right option as it doesn't hold the memory (qconf -sc will say it is not consumable). What does your qconf -sc report for mem_free and h_vmem? Using a non-consumable resource will let too many jobs get scheduled on a single node and overload its memory. How much memory does the node the precompute jobs land on have? How many jobs are getting scheduled on each node simultaneously?

GeoMicroSoares commented 5 years ago

Hi @skoren ,

The switch from mem_free to h_vmem was due to a re-analysis of qconf -sc outputs. As you see below, h_vmem is the consumable option.

$ qconf -sc
#name               shortcut   type        relop requestable consumable default  urgency 
#----------------------------------------------------------------------------------------
...
h_vmem              h_vmem     MEMORY      <=    YES         JOB        2G       0
mem_free            mf         MEMORY      <=    YES         NO         0        0
...
# >#< starts a comment but comments are not saved across edits --------

Not too sure if I can answer the rest of your questions properly, but when canu starts running I can see the following:

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
3192050 0.58333 cormhap_mo ans74        r     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 1
3192050 0.54167 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 2
3192050 0.52778 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 3
3192050 0.52083 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 4
3192050 0.51667 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 5
3192050 0.51389 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 6
3192050 0.51190 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 7
3192050 0.51042 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 8
3192050 0.50926 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 9
3192050 0.50833 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 10
3192051 0.50758 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 1
3192051 0.50694 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 2
3192051 0.50641 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 3
3192051 0.50595 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 4
3192051 0.50556 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 5
3192051 0.50521 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 6
3192051 0.50490 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node009.cm.cluster           1 7
3192051 0.50463 cormhap_mo ans74        r     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 8
3192051 0.50439 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 9
3192051 0.50417 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 10
3192052 0.50397 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 1
3192052 0.50379 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 2
3192052 0.50362 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 3
3192052 0.50347 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 4
3192052 0.50333 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 5
3192052 0.50321 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 6
3192052 0.50309 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 7
3192052 0.50298 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 8
3192052 0.50287 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 9
3192052 0.50278 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 10
3192053 0.50269 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 1
3192053 0.50260 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 2
3192053 0.50253 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 3
3192053 0.50245 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 4
3192053 0.50238 cormhap_mo ans74        t     03/12/2019 17:20:17 amd.q@node008.cm.cluster           1 5
3192054 0.00000 canu_moncl ans74        hqw   03/12/2019 17:20:03                                    1      

And these nodes have: node008 (amd.q) - 64 Cores 4 x AMD Opteron(tm) Processor 6376 256GB RAM node009 (amd.q) - 64 Cores 4 x AMD Opteron(tm) Processor 6376 256GB RAM

Hope this helps!

brianwalenz commented 5 years ago

I vaguely remember encountering something like this when I was trying to run many many little assemblies at the same time. I think that even though, like you've done, I was within the memory limits of the machine, I couldn't get java to initialize.

As a test, try submitting only one or two jobs to the grid. The submit commands are saved in the *jobSubmit*sh commands; just change the -t parameter to a single digit or 1-2 or something like that.

If that works, change src/pipelines/canu/Execution.pm at line 908 from:

    if (uc(getGlobal("gridEngine")) eq "SGE") {
        $m /= $t;
    }

to

    #if (uc(getGlobal("gridEngine")) eq "SGE") {
    #    $m /= $t;
    #}

This changes the way canu requests memory, from 'memory per slot' (a consumable of YES) to 'memory per job' (a consumable of JOB, which your grid is using). You can now increase mhapThreads to 8, which will limit you to at most 8 jobs per machine, instead of the ~20 you're getting now.

The other thing that could be happening is that other jobs on the same nodes are sucking up all the memory without accounting for it. Increasing the thread count will help here too, but just because it reduces the number of jobs per machine.

GeoMicroSoares commented 5 years ago

Hi @brianwalenz

Ran ./canu/Linux-amd64/bin/canu -p metagenomes -d /scratch/ genomeSize=0.5g -nanopore-raw metagenomes_folder/metagenomes.fastq java="/cm/shared/apps/java/jdk1.8.0_40/bin/java" gridEngineResourceOption="-pe multithread THREADS -l h_vmem=MEMORY" merylMemory=10 merylThreads=1 cormhapMemory=10 cormhapThreads=1 mhapThreads=8 gridOptionsExecutive="-l h_vmem=100g" gridEngineArrayMaxJobs=5 gridOptions="-S /bin/bash"

And still got the same error:

...
-- Mhap precompute jobs failed, tried 2 times, giving up.
--   job correction/1-overlapper/blocks/000001.dat FAILED.
--   job correction/1-overlapper/blocks/000002.dat FAILED.
--   job correction/1-overlapper/blocks/000003.dat FAILED.
--   job correction/1-overlapper/blocks/000004.dat FAILED.
--   job correction/1-overlapper/blocks/000005.dat FAILED.
--   job correction/1-overlapper/blocks/000006.dat FAILED.
--   job correction/1-overlapper/blocks/000007.dat FAILED.
--   job correction/1-overlapper/blocks/000008.dat FAILED.
--   job correction/1-overlapper/blocks/000009.dat FAILED.
--   job correction/1-overlapper/blocks/000010.dat FAILED.
--   job correction/1-overlapper/blocks/000011.dat FAILED.
--   job correction/1-overlapper/blocks/000012.dat FAILED.
--   job correction/1-overlapper/blocks/000013.dat FAILED.
--   job correction/1-overlapper/blocks/000014.dat FAILED.
--   job correction/1-overlapper/blocks/000015.dat FAILED.
--   job correction/1-overlapper/blocks/000016.dat FAILED.
--   job correction/1-overlapper/blocks/000017.dat FAILED.
--   job correction/1-overlapper/blocks/000018.dat FAILED.
--   job correction/1-overlapper/blocks/000019.dat FAILED.
--   job correction/1-overlapper/blocks/000020.dat FAILED.
--   job correction/1-overlapper/blocks/000021.dat FAILED.
--   job correction/1-overlapper/blocks/000022.dat FAILED.
--   job correction/1-overlapper/blocks/000023.dat FAILED.
--   job correction/1-overlapper/blocks/000024.dat FAILED.
--   job correction/1-overlapper/blocks/000025.dat FAILED.
--   job correction/1-overlapper/blocks/000026.dat FAILED.
--   job correction/1-overlapper/blocks/000027.dat FAILED.
--   job correction/1-overlapper/blocks/000028.dat FAILED.
--   job correction/1-overlapper/blocks/000029.dat FAILED.
--   job correction/1-overlapper/blocks/000030.dat FAILED.
--   job correction/1-overlapper/blocks/000031.dat FAILED.
--   job correction/1-overlapper/blocks/000032.dat FAILED.
--   job correction/1-overlapper/blocks/000033.dat FAILED.
--   job correction/1-overlapper/blocks/000034.dat FAILED.
--   job correction/1-overlapper/blocks/000035.dat FAILED.
--

ABORT:
ABORT: Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
skoren commented 5 years ago

I don't think the jobs picked up the new parameters, remove the 1-overlapper folder completely and re-run the command. If it still fails post the precompute.sh and precompute.jobSubmit-01.sh files.

GeoMicroSoares commented 5 years ago

Hi again @skoren , it did fail again, here's the files you asked for:

$ cat /ibers/ernie/scratch/ans74/correction/1-overlapper/precompute.sh 
#!/bin/sh

#  Path to Canu.

syst=`uname -s`
arch=`uname -m | sed s/x86_64/amd64/`

bin="/ibers/ernie/home/ans74/canu/$syst-$arch/bin"

if [ ! -d "$bin" ] ; then
  bin="/ibers/ernie/home/ans74/canu"
fi

#  Report paths.

echo ""
echo "Found perl:"
echo "  " `which perl`
echo "  " `perl --version | grep version`
echo ""
echo "Found java:"
echo "  " `which /cm/shared/apps/java/jdk1.8.0_40/bin/java`
echo "  " `/cm/shared/apps/java/jdk1.8.0_40/bin/java -showversion 2>&1 | head -n 1`
echo ""
echo "Found canu:"
echo "  " $bin/canu
echo "  " `$bin/canu -version`
echo ""

#  Environment for any object storage.

export CANU_OBJECT_STORE_CLIENT=
export CANU_OBJECT_STORE_CLIENT_UA=
export CANU_OBJECT_STORE_CLIENT_DA=
export CANU_OBJECT_STORE_NAMESPACE=moncler
export CANU_OBJECT_STORE_PROJECT=

#  Discover the job ID to run, from either a grid environment variable and a
#  command line offset, or directly from the command line.
#
if [ x$SGE_TASK_ID = x -o x$SGE_TASK_ID = xundefined -o x$SGE_TASK_ID = x0 ]; then
  baseid=$1
  offset=0
else
  baseid=$SGE_TASK_ID
  offset=$1
fi
if [ x$offset = x ]; then
  offset=0
fi
if [ x$baseid = x ]; then
  echo Error: I need SGE_TASK_ID set, or a job index on the command line.
  exit
fi
jobid=`expr $baseid + $offset`
if [ x$SGE_TASK_ID = x ]; then
  echo Running job $jobid based on command line options.
else
  echo Running job $jobid based on SGE_TASK_ID=$SGE_TASK_ID and offset=$offset.
fi

if [ $jobid -eq 1 ] ; then
  rge="-b 1 -e 13500"
  job="000001"
fi

if [ $jobid -eq 2 ] ; then
  rge="-b 13501 -e 27000"
  job="000002"
fi

if [ $jobid -eq 3 ] ; then
  rge="-b 27001 -e 40500"
  job="000003"
fi

if [ $jobid -eq 4 ] ; then
  rge="-b 40501 -e 54000"
  job="000004"
fi

if [ $jobid -eq 5 ] ; then
  rge="-b 54001 -e 67500"
  job="000005"
fi

if [ $jobid -eq 6 ] ; then
  rge="-b 67501 -e 81000"
  job="000006"
fi

if [ $jobid -eq 7 ] ; then
  rge="-b 81001 -e 94500"
  job="000007"
fi

if [ $jobid -eq 8 ] ; then
  rge="-b 94501 -e 108000"
  job="000008"
fi

if [ $jobid -eq 9 ] ; then
  rge="-b 108001 -e 121500"
  job="000009"
fi

if [ $jobid -eq 10 ] ; then
  rge="-b 121501 -e 135000"
  job="000010"
fi

if [ $jobid -eq 11 ] ; then
  rge="-b 135001 -e 148500"
  job="000011"
fi

if [ $jobid -eq 12 ] ; then
  rge="-b 148501 -e 162000"
  job="000012"
fi

if [ $jobid -eq 13 ] ; then
  rge="-b 162001 -e 175500"
  job="000013"
fi

if [ $jobid -eq 14 ] ; then
  rge="-b 175501 -e 189000"
  job="000014"
fi

if [ $jobid -eq 15 ] ; then
  rge="-b 189001 -e 202500"
  job="000015"
fi

if [ $jobid -eq 16 ] ; then
  rge="-b 202501 -e 216000"
  job="000016"
fi

if [ $jobid -eq 17 ] ; then
  rge="-b 216001 -e 229500"
  job="000017"
fi

if [ $jobid -eq 18 ] ; then
  rge="-b 229501 -e 243000"
  job="000018"
fi

if [ $jobid -eq 19 ] ; then
  rge="-b 243001 -e 256500"
  job="000019"
fi

if [ $jobid -eq 20 ] ; then
  rge="-b 256501 -e 270000"
  job="000020"
fi

if [ $jobid -eq 21 ] ; then
  rge="-b 270001 -e 283500"
  job="000021"
fi

if [ $jobid -eq 22 ] ; then
  rge="-b 283501 -e 297000"
  job="000022"
fi

if [ $jobid -eq 23 ] ; then
  rge="-b 297001 -e 310500"
  job="000023"
fi

if [ $jobid -eq 24 ] ; then
  rge="-b 310501 -e 324000"
  job="000024"
fi

if [ $jobid -eq 25 ] ; then
  rge="-b 324001 -e 337500"
  job="000025"
fi

if [ $jobid -eq 26 ] ; then
  rge="-b 337501 -e 351000"
  job="000026"
fi

if [ $jobid -eq 27 ] ; then
  rge="-b 351001 -e 364500"
  job="000027"
fi

if [ $jobid -eq 28 ] ; then
  rge="-b 364501 -e 378000"
  job="000028"
fi

if [ $jobid -eq 29 ] ; then
  rge="-b 378001 -e 391500"
  job="000029"
fi

if [ $jobid -eq 30 ] ; then
  rge="-b 391501 -e 405000"
  job="000030"
fi

if [ $jobid -eq 31 ] ; then
  rge="-b 405001 -e 418500"
  job="000031"
fi

if [ $jobid -eq 32 ] ; then
  rge="-b 418501 -e 432000"
  job="000032"
fi

if [ $jobid -eq 33 ] ; then
  rge="-b 432001 -e 445500"
  job="000033"
fi

if [ $jobid -eq 34 ] ; then
  rge="-b 445501 -e 459000"
  job="000034"
fi

if [ $jobid -eq 35 ] ; then
  rge="-b 459001 -e 459052"
  job="000035"
fi

if [ x$job = x ] ; then
  echo Job partitioning error.  jobid $jobid is invalid.
  exit 1
fi

if [ ! -d ./blocks ]; then
  mkdir -p ./blocks
fi

if [ -e blocks/$job.dat ]; then
  exists=true
else
  exists=false
fi
if [ $exists = true ] ; then
  echo Job previously completed successfully.
  exit
fi

#  Remove any previous result.
rm -f ./blocks/$job.input.dat

$bin/sqStoreDumpFASTQ \
  -S ../../moncler.seqStore \
  $rge \
  -nolibname \
  -noreadname \
  -fasta \
  -o ./blocks/$job.input \
|| \
mv -f ./blocks/$job.input.fasta ./blocks/$job.input.fasta.FAILED

if [ ! -e ./blocks/$job.input.fasta ] ; then
  echo Failed to extract fasta.
  exit 1
fi

echo ""
echo Starting mhap precompute.
echo ""

#  So mhap writes its output in the correct spot.
cd ./blocks

/cm/shared/apps/java/jdk1.8.0_40/bin/java -d64 -XX:ParallelGCThreads=8 -server -Xms9216m -Xmx9216m \
  -jar  $bin/../share/java/classes/mhap-2.1.3.jar  \
  --repeat-weight 0.9 --repeat-idf-scale 10 -k 16 \
  --store-full-id \
  --num-hashes 768 \
  --num-min-matches 2 \
  --ordered-sketch-size 1536 \
  --ordered-kmer-size 12 \
  --threshold 0.78 \
  --filter-threshold 0.0000001 \
  --min-olap-length 500 \
  --num-threads 8 \
  -f  ../../0-mercounts/moncler.ms16.ignore.gz  \
  -p  ./$job.input.fasta  \
  -q  .  \
&& \
mv -f ./$job.input.dat ./$job.dat

if [ ! -e ./$job.dat ] ; then
  echo Mhap failed.
  exit 1
fi

#  Clean up, remove the fasta input
rm -f ./$job.input.fasta

exit 0
$ cat /ibers/ernie/scratch/ans74/correction/1-overlapper/precompute.jobSubmit-01.sh 
#!/bin/sh

qsub \
  -pe multithread 8 -l h_vmem=1280m -S /bin/bash -j y -o precompute.\$TASK_ID.out \
  -cwd -N "cormhap_moncler" \
  -t 1-5 \
  ./precompute.sh 0 \
> ./precompute.jobSubmit-01.out 2>&1
skoren commented 5 years ago

The parameters canu is passing are correct but I think your grid is still not handling the resource request correctly. How many jobs end up on the same node now?

GeoMicroSoares commented 5 years ago

@skoren yeah I think not too many, as ideally expected no?

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
3192224 1.01667 cormhap_mo ans74        r     03/13/2019 15:40:02 amd.q@node006.cm.cluster           8 2
3192224 0.80833 cormhap_mo ans74        r     03/13/2019 15:40:02 amd.q@node006.cm.cluster           8 3
3192224 0.73889 cormhap_mo ans74        r     03/13/2019 15:40:02 amd.q@node006.cm.cluster           8 4
3192224 0.70417 cormhap_mo ans74        r     03/13/2019 15:40:02 amd.q@node007.cm.cluster           8 5
3192225 0.68333 cormhap_mo ans74        r     03/13/2019 15:40:02 amd.q@node007.cm.cluster           8 1
3192225 0.66944 cormhap_mo ans74        r     03/13/2019 15:40:02 amd.q@node008.cm.cluster           8 2
3192225 0.65952 cormhap_mo ans74        r     03/13/2019 15:40:02 amd.q@node008.cm.cluster           8 3
3192225 0.65208 cormhap_mo ans74        r     03/13/2019 15:40:02 amd.q@node008.cm.cluster           8 4
3192225 0.64630 cormhap_mo ans74        r     03/13/2019 15:40:02 amd.q@node008.cm.cluster           8 5
3192226 0.64167 cormhap_mo ans74        r     03/13/2019 15:40:02 amd.q@node008.cm.cluster           8 1
3192226 0.63472 cormhap_mo ans74        r     03/13/2019 15:40:02 amd.q@node011.cm.cluster           8 3
3192226 0.63205 cormhap_mo ans74        r     03/13/2019 15:40:02 amd.q@node011.cm.cluster           8 4
3192226 0.62976 cormhap_mo ans74        r     03/13/2019 15:40:02 amd.q@node011.cm.cluster           8 5
3192227 0.62778 cormhap_mo ans74        r     03/13/2019 15:40:02 amd.q@node011.cm.cluster           8 1
3192227 0.62451 cormhap_mo ans74        qw    03/13/2019 15:39:37                                    8 3-5:1
3192228 0.00000 canu_moncl ans74        hqw   03/13/2019 15:39:37                                    1    

It still fails after this btw, in the same way as before and even after removing correction.

skoren commented 5 years ago

Yes the number per machine seems OK. Does your grid enforce memory limits on submitted jobs and kill them if h_vmem is exceeded? What does qacct report for one of the failed jobs and is the output in the precompute.*.out still the same?

GeoMicroSoares commented 5 years ago

Hi again,

The grid does cancel jobs if memory limits are exceeded.

Checked qacct on the last cormhap job running before the whole thing got cancelled:

$ qacct -j 3192235
==============================================================
qname        amd.q               
hostname     node008.cm.cluster  
group        users               
owner        ans74               
project      NONE                
department   defaultdepartment   
jobname      cormhap_moncler     
jobnumber    3192235             
taskid       1                   
account      sge                 
priority     0                   
qsub_time    Wed Mar 13 15:40:38 2019
start_time   Wed Mar 13 15:41:02 2019
end_time     Wed Mar 13 15:41:02 2019
granted_pe   multithread         
slots        8                   
failed       0    
exit_status  1                   
ru_wallclock 0            
ru_utime     0.244        
ru_stime     0.159        
ru_maxrss    19496               
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    15051               
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   82976               
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     1624                
ru_nivcsw    1316                
cpu          0.403        
mem          0.000             
io           0.000             
iow          0.000             
maxvmem      0.000
arid         undefined
==============================================================
qname        amd.q               
hostname     node008.cm.cluster  
group        users               
owner        ans74               
project      NONE                
department   defaultdepartment   
jobname      cormhap_moncler     
jobnumber    3192235             
taskid       2                   
account      sge                 
priority     0                   
qsub_time    Wed Mar 13 15:40:38 2019
start_time   Wed Mar 13 15:41:02 2019
end_time     Wed Mar 13 15:41:03 2019
granted_pe   multithread         
slots        8                   
failed       0    
exit_status  1                   
ru_wallclock 1            
ru_utime     0.263        
ru_stime     0.142        
ru_maxrss    19564               
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    15068               
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   82984               
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     2180                
ru_nivcsw    1319                
cpu          0.405        
mem          0.003             
io           0.070             
iow          0.000             
maxvmem      44.680M
arid         undefined
==============================================================
qname        amd.q               
hostname     node011.cm.cluster  
group        users               
owner        ans74               
project      NONE                
department   defaultdepartment   
jobname      cormhap_moncler     
jobnumber    3192235             
taskid       3                   
account      sge                 
priority     0                   
qsub_time    Wed Mar 13 15:40:38 2019
start_time   Wed Mar 13 15:41:17 2019
end_time     Wed Mar 13 15:41:17 2019
granted_pe   multithread         
slots        8                   
failed       0    
exit_status  1                   
ru_wallclock 0            
ru_utime     0.251        
ru_stime     0.143        
ru_maxrss    19540               
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    15062               
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   82872               
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     1671                
ru_nivcsw    1316                
cpu          0.394        
mem          0.000             
io           0.000             
iow          0.000             
maxvmem      0.000
arid         undefined
==============================================================
qname        amd.q               
hostname     node011.cm.cluster  
group        users               
owner        ans74               
project      NONE                
department   defaultdepartment   
jobname      cormhap_moncler     
jobnumber    3192235             
taskid       4                   
account      sge                 
priority     0                   
qsub_time    Wed Mar 13 15:40:38 2019
start_time   Wed Mar 13 15:41:17 2019
end_time     Wed Mar 13 15:41:17 2019
granted_pe   multithread         
slots        8                   
failed       0    
exit_status  1                   
ru_wallclock 0            
ru_utime     0.254        
ru_stime     0.144        
ru_maxrss    19552               
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    15065               
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   80728               
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     1911                
ru_nivcsw    1285                
cpu          0.398        
mem          0.000             
io           0.000             
iow          0.000             
maxvmem      0.000
arid         undefined
==============================================================
qname        amd.q               
hostname     node011.cm.cluster  
group        users               
owner        ans74               
project      NONE                
department   defaultdepartment   
jobname      cormhap_moncler     
jobnumber    3192235             
taskid       5                   
account      sge                 
priority     0                   
qsub_time    Wed Mar 13 15:40:38 2019
start_time   Wed Mar 13 15:41:17 2019
end_time     Wed Mar 13 15:41:17 2019
granted_pe   multithread         
slots        8                   
failed       0    
exit_status  1                   
ru_wallclock 0            
ru_utime     0.138        
ru_stime     0.088        
ru_maxrss    19432               
ru_ixrss     0                   
ru_ismrss    0                   
ru_idrss     0                   
ru_isrss     0                   
ru_minflt    15035               
ru_majflt    0                   
ru_nswap     0                   
ru_inblock   0                   
ru_oublock   576                 
ru_msgsnd    0                   
ru_msgrcv    0                   
ru_nsignals  0                   
ru_nvcsw     975                 
ru_nivcsw    32                  
cpu          0.226        
mem          0.000             
io           0.000             
iow          0.000             
maxvmem      0.000
arid         undefined

And yes, error on precompute.*.out is still the same:

...
==> /ibers/ernie/scratch/ans74/correction/1-overlapper/precompute.5.out <==
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 35 based on SGE_TASK_ID=5 and offset=30.
Dumping reads from 459001 to 459052 (inclusive).

Starting mhap precompute.

Error occurred during initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.
...
skoren commented 5 years ago

I think the issue is the h_vmem request and @brianwalenz's previous comment. Canu is requesting memory per core but your grid expects it to be per job, so as soon as it tries to allocate 10gb it gets killed since that's >1gb it requested. That doesn't explain why it wasn't running with the previous setting of 1 core and 10gb memory. I suspect your JVM is also configured to start up with too much overhead memory on top of the user-specified request. Edit the precompute.jobSubmit-01.sh file to be:

#!/bin/sh

qsub \
  -pe multithread 8 -l h_vmem=20800m -S /bin/bash -j y -o precompute.\$TASK_ID.out \
  -cwd -N "cormhap_moncler" \
  -t 1 \
  ./precompute.sh 0 \
> ./precompute.jobSubmit-01.out 2>&1

and run it to see if that job runs successfully. If that works, you need to make the code change Brian suggested earlier, otherwise all your jobs will get killed for incorrect memory requests (or you can run everything with 1 core). You also need to explicitly request more memory for the java processes to compensate for the JVM configuration (gridOptionscormhap="-h_vmem=20g).

GeoMicroSoares commented 5 years ago

Ok then, modified precompute.jobSubmit-01.sh, ran it, and it seems despite java errors it still ran?

$ cat precompute.1.out 

Found perl:
   /usr/bin/perl

Found java:
   /cm/shared/apps/java/jdk1.8.0_40/bin/java
   Error occurred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
M

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
M

Found canu:
   /ibers/ernie/home/ans74Running job 6 based on SGE_Use of implicit split toDumping reads from 67501 to 81000 (inclusive).
nu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
_TASK_ID=1 and offset=0.
Dumping reads from 1 to 13500 (inclusive).
a
   Error occurred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
TASK_ID=1 and offset=5.
Dumping reads from 67501 to 81000 (inclusive).
ng initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
iled.
Mhap failed.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 1 based on SGE_TASK_ID=1 and offset=0.
Dumping reads from 1 to 13500 (inclusive).

Starting mhap precompute.

Error occurred during initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.

Found perl:
   /usr/bin/perl

Found java:
   /cm/shared/apps/java/jdk1.8.0_40/bin/java

Found perl:
rred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
java

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
   Error occurred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Grid_Cloud.pm line 74.
5.
Dumping reads from 202501 to 216000 (inclusive).
26004b26f21686f0c324ab69d37c5a885)

Running job 11 based on SGE_TASK_ID=1 and offset=10.
Dumping reads from 135001 to 148500 (inclusive).
rs/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
iled.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 11 based on SGE_TASK_ID=1 and offset=10.
Dumping reads from 135001 to 148500 (inclusive).
 initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.

Starting mhap precompute.

Error occurred during initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.

Found perl:
   /usr/bin/perl

Found java:
   /cm/shared/apps/java/jdk1.8.0_40/bin/java

Found perl:
rred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linu   Error occurred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
ccurred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/LiDumping reads from 202501 to 216000 (inclusive).
 deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 21 based on SGE_TASK_ID=1 and offset=20.
Dumping reads from 270001 to 283500 (inclusive).
rs/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
iled.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

rve enough space for 9437184KB object heap
Mhap failed.
Running job 21 based on SGE_TASK_ID=1 and offset=20.
Dumping reads from 270001 to 283500 (inclusive).

Starting mhap precompute.

Error occurred during initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.

Found perl:
   /usr/bin/perl

Found java:
   /cm/shared/apps/java/jdk1.8.0_40/bin/java
   Error occurred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 26 based on SGE_TASK_ID=1 and offset=25.
Dumping reads from 337501 to 351000 (inclusive).
/cm/shared/apps/java/jdk1.8.0_40/bin/java
   Error occurred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 31 based on SGE_TASK_ID=1 and offset=30.
Dumping reads from 405001 to 418500 (inclusive).
25.
Dumping reads from 337501 to 351000 (inclusive).
 initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.

Starting mhap precompute.

Error occurred during initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.

Starting mhap precompute.

Error occurred during initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.

Found perl:
   /usr/bin/perl

Found java:
   /cm/shared/apps/java/jdk1.8.0_40/bin/java
   Error occurred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 31 based on SGE_TASK_ID=1 and offset=30.
Dumping reads from 405001 to 418500 (inclusive).

Starting mhap precompute.

Error occurred during initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.

Found perl:
   /usr/bin/perl

Found java:
   /cm/shared/apps/java/jdk1.8.0_40/bin/java
   Error occurred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/..   Error occurred during initialization of 
Found perl:
u:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
-amd64/bin/canu
Running job 6 based on SGE_TASK_ID=1 and offset=5.
Dumping reads from 67501 to 81000 (inclusive).
bers/ernie/home/ans74/canu/Linu   Error occurred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
26f21686f0c324ab69d37c5a885)

Running job 1 based on SGE_TASK_ID=1 and offset=0.
Dumping reads from 1 to 13500 (inclusive).
fset=5.
Dumping reads from 67501 to 81000 (inclusive).
ng initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
iled.
Mhap failed.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 1 based on SGE_TASK_ID=1 and offset=0.
Dumping reads from 1 to 13500 (inclusive).

Starting mhap precompute.

Error occurred during initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.

Found perl:
   /usr/bin/perl

Found java:
   /cm/shared/apps/java/jdk1.8.0_40/bin/java
   /usr/bin/perl
 during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/   Error occurred durUse of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
    Error occurred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
+151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 16 based on SGE_TASK_ID=1 and offset=15.
Dumping reads from 202501 to 216000 (inclusive).
10.
Dumping reads from 135001 to 148500 (inclusive).
rs/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
iled.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

rve enough space for 9437184KB object heap
Mhap failed.
Running job 11 based on SGE_TASK_ID=1 and offset=10.
Dumping reads from 135001 to 148500 (inclusive).

Starting mhap precompute.

Error occurred during initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.

Found perl:
   /usr/bin/perl

Found java:
   /cm/shared/apps/java/jdk1.8.0_40/bin/java
   Error occurred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Clou   Error occurred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
6 based on SGE_TASK_ID=1 and offset=15.
Dumping reads from 202501 to 216000 (inclusive).
rs/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 21 based on SGE_TASK_ID=1 and offset=20.
Dumping reads from 270001 to 283500 (inclusive).
rs/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
iled.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

rve enough space for 9437184KB object heap
Mhap failed.
Running job 21 based on SGE_TASK_ID=1 and offset=20.
Dumping reads from 270001 to 283500 (inclusive).

Starting mhap precompute.

Error occurred during initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.

Found perl:
   /usr/bin/perl

Found java:
   /cm/shared/apps/java/jdk1.8.0_40/bin/java
   Error occurred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/.   Error occurred during initialization of VM

Found canu:
   /ibers/ernie/hom   Error occurred during initializUse of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_peDumping reads from 337501 to 351000 (inclusive).
rs/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 26 based on SGE_TASK_ID=1 and offset=25.
Dumping reads from 337501 to 351000 (inclusive).
26004b26f21686f0c324ab69d37c5a885)

Running job 31 based on SGE_TASK_ID=1 and offset=30.
Dumping reads from 405001 to 418500 (inclusive).
 initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.

Starting mhap precompute.

Error occurred during initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.

Starting mhap precompute.

Error occurred during initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.

Found perl:
   /usr/bin/perl

Found java:
   /cm/shared/apps/java/jdk1.8.0_40/bin/java
   Error occurred during initialization of VM

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 31 based on SGE_TASK_ID=1 and offset=30.
Dumping reads from 405001 to 418500 (inclusive).

Starting mhap precompute.

Error occurred during initialization of VM
Could not reserve enough space for 9437184KB object heap
Mhap failed.

Found perl:
   /usr/bin/perl

Found java:
   /cm/shared/apps/java/jdk1.8.0_40/bin/java
   java version "1.8.0_40"

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 1 based on SGE_TASK_ID=1 and offset=0.
Dumping reads from 1 to 13500 (inclusive).

Starting mhap precompute.

Running with these settings:
--filter-threshold = 1.0E-7
--help = false
--max-shift = 0.2
--min-olap-length = 500
--min-store-length = 0
--no-rc = false
--no-self = false
--no-tf = false
--num-hashes = 768
--num-min-matches = 2
--num-threads = 8
--ordered-kmer-size = 12
--ordered-sketch-size = 1536
--repeat-idf-scale = 10.0
--repeat-weight = 0.9
--settings = 0
--store-full-id = true
--supress-noise = 0
--threshold = 0.78
--version = false
-f = ../../0-mercounts/moncler.ms16.ignore.gz
-h = false
-k = 16
-p = ./000001.input.fasta
-q = .
-s = 

Reading in filter file ../../0-mercounts/moncler.ms16.ignore.gz.
Read in values for repeat 74668 and 0
Warning, k-mer filter file has zero elements.
Initializing
Trying size 74668 and it was successfull
Initialized
Time (s) to read filter file: 2.398091821
Read in k-mer filter for sizes: [16]
Processing FASTA files for binary compression...
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
Current # sequences loaded and processed from file: 20000...
Current # sequences loaded and processed from file: 25000...
Processed 27000 sequences (fwd and rev).
Read, hashed, and stored file ./000001.input.fasta to ./000001.input.dat.
Time (s): 482.01313936300005
Total time (s): 482.01364922600004

Changed the code as @brianwalenz sugested and re-ran my canu command ca. 50min ago and things seem to be going fine for now! Will let you know if anything else fails. Cheers!

skoren commented 5 years ago

SGE keeps appending to the log not overwriting so all those previous errors are from other trials. The last one was OK and ran without issue. Check qacct for the job to see what memory it actually used to see what the JVM overhead is.

GeoMicroSoares commented 5 years ago

@skoren the job has run indeed but gotten to Mhap overlap and failed again. Any ideas? canu.out output below:

...
-- BEGIN CORRECTION
--
--
-- Mhap overlap jobs failed, tried 2 times, giving up.
--   job correction/1-overlapper/results/000001.ovb FAILED.
--   job correction/1-overlapper/results/000002.ovb FAILED.
--   job correction/1-overlapper/results/000003.ovb FAILED.
--   job correction/1-overlapper/results/000004.ovb FAILED.
--   job correction/1-overlapper/results/000005.ovb FAILED.
--   job correction/1-overlapper/results/000006.ovb FAILED.
--   job correction/1-overlapper/results/000007.ovb FAILED.
--   job correction/1-overlapper/results/000008.ovb FAILED.
--   job correction/1-overlapper/results/000009.ovb FAILED.
--   job correction/1-overlapper/results/000010.ovb FAILED.
--   job correction/1-overlapper/results/000011.ovb FAILED.
--   job correction/1-overlapper/results/000012.ovb FAILED.
--   job correction/1-overlapper/results/000013.ovb FAILED.
--   job correction/1-overlapper/results/000014.ovb FAILED.
--   job correction/1-overlapper/results/000015.ovb FAILED.
--   job correction/1-overlapper/results/000016.ovb FAILED.
--   job correction/1-overlapper/results/000017.ovb FAILED.
--   job correction/1-overlapper/results/000018.ovb FAILED.
--   job correction/1-overlapper/results/000019.ovb FAILED.
--   job correction/1-overlapper/results/000020.ovb FAILED.
--   job correction/1-overlapper/results/000021.ovb FAILED.
--   job correction/1-overlapper/results/000022.ovb FAILED.
--   job correction/1-overlapper/results/000023.ovb FAILED.
--   job correction/1-overlapper/results/000024.ovb FAILED.
--   job correction/1-overlapper/results/000025.ovb FAILED.
--   job correction/1-overlapper/results/000026.ovb FAILED.
--   job correction/1-overlapper/results/000027.ovb FAILED.
--   job correction/1-overlapper/results/000028.ovb FAILED.
--   job correction/1-overlapper/results/000029.ovb FAILED.
--   job correction/1-overlapper/results/000030.ovb FAILED.
--   job correction/1-overlapper/results/000031.ovb FAILED.
--   job correction/1-overlapper/results/000032.ovb FAILED.
--   job correction/1-overlapper/results/000033.ovb FAILED.
--   job correction/1-overlapper/results/000034.ovb FAILED.
--   job correction/1-overlapper/results/000035.ovb FAILED.
--   job correction/1-overlapper/results/000036.ovb FAILED.
--   job correction/1-overlapper/results/000037.ovb FAILED.
--   job correction/1-overlapper/results/000038.ovb FAILED.
--   job correction/1-overlapper/results/000039.ovb FAILED.
--   job correction/1-overlapper/results/000040.ovb FAILED.
--   job correction/1-overlapper/results/000041.ovb FAILED.
--   job correction/1-overlapper/results/000042.ovb FAILED.
--   job correction/1-overlapper/results/000043.ovb FAILED.
--   job correction/1-overlapper/results/000044.ovb FAILED.
--   job correction/1-overlapper/results/000045.ovb FAILED.
--   job correction/1-overlapper/results/000046.ovb FAILED.
--   job correction/1-overlapper/results/000047.ovb FAILED.
--   job correction/1-overlapper/results/000048.ovb FAILED.
--   job correction/1-overlapper/results/000049.ovb FAILED.
--   job correction/1-overlapper/results/000050.ovb FAILED.
--   job correction/1-overlapper/results/000051.ovb FAILED.
--   job correction/1-overlapper/results/000052.ovb FAILED.
--   job correction/1-overlapper/results/000053.ovb FAILED.
--   job correction/1-overlapper/results/000054.ovb FAILED.
--   job correction/1-overlapper/results/000055.ovb FAILED.
--   job correction/1-overlapper/results/000056.ovb FAILED.
--   job correction/1-overlapper/results/000057.ovb FAILED.
--   job correction/1-overlapper/results/000058.ovb FAILED.
--   job correction/1-overlapper/results/000059.ovb FAILED.
--   job correction/1-overlapper/results/000060.ovb FAILED.
--   job correction/1-overlapper/results/000061.ovb FAILED.
--   job correction/1-overlapper/results/000062.ovb FAILED.
--   job correction/1-overlapper/results/000063.ovb FAILED.
--   job correction/1-overlapper/results/000064.ovb FAILED.
--   job correction/1-overlapper/results/000065.ovb FAILED.
--   job correction/1-overlapper/results/000066.ovb FAILED.
--   job correction/1-overlapper/results/000067.ovb FAILED.
--   job correction/1-overlapper/results/000068.ovb FAILED.
--   job correction/1-overlapper/results/000069.ovb FAILED.
--   job correction/1-overlapper/results/000070.ovb FAILED.
--   job correction/1-overlapper/results/000071.ovb FAILED.
--   job correction/1-overlapper/results/000072.ovb FAILED.
--   job correction/1-overlapper/results/000073.ovb FAILED.
--   job correction/1-overlapper/results/000074.ovb FAILED.
--   job correction/1-overlapper/results/000075.ovb FAILED.
--   job correction/1-overlapper/results/000076.ovb FAILED.
--   job correction/1-overlapper/results/000077.ovb FAILED.
--   job correction/1-overlapper/results/000078.ovb FAILED.
--   job correction/1-overlapper/results/000079.ovb FAILED.
--   job correction/1-overlapper/results/000080.ovb FAILED.
--   job correction/1-overlapper/results/000081.ovb FAILED.
--   job correction/1-overlapper/results/000082.ovb FAILED.
--   job correction/1-overlapper/results/000083.ovb FAILED.
--

ABORT:
ABORT: Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
skoren commented 5 years ago

Yes it would fail because as I said in my previous comment, your JVM seems to be configured to use a larger overhead GC than the default.

Did you check via qacct what the precompute jobs required to run before? I know that 20gb is enough for mhap but less might be sufficient if you check qacct. You'll be able to run by adding the parameter I listed earlier: gridOptionscormhap="-h_vmem=20g.

GeoMicroSoares commented 5 years ago

Hi again @skoren , the job keeps failing with gridOptionscormhap="-l h_vmem=20g". Tried adding gridOptionsovb="-l h_vmem=20g" as well after deleting the correction folder and re-started the job, but keep getting the same output as above. Current command I'm using: ./canu/Linux-amd64/bin/canu -p moncler -d /ibers/ernie/scratch/ans74/ genomeSize=0.5g -nanopore-raw montcler_metagenomes/moncler_surface.fastq java="/cm/shared/apps/java/jdk1.8.0_40/bin/java" gridEngineResourceOption="-pe multithread THREADS -l h_vmem=MEMORY" merylMemory=10 merylThreads=1 cormhapMemory=10 cormhapThreads=1 mhapThreads=8 gridOptionsExecutive="-l h_vmem=100g" gridEngineArrayMaxJobs=5 gridOptions="-S /bin/bash" gridOptionscormhap="-l h_vmem=20g" gridOptionsovb="-l h_vmem=20g"

skoren commented 5 years ago

Interesting, is it the same memory allocation error in the mhap.*.out files? Does the mhap.jobSubmit-*.sh file have the correct memory request with 20g? If not you can remove the mhap.*sh files to have Canu re-generate them. You could also try increasing the memory from 20 to 30 or 40gb.

GeoMicroSoares commented 5 years ago

Hello again,

Seems the memory is being allocated properly now:

$ cat /ibers/ernie/scratch/ans74/correction/1-overlapper/mhap.1.out 

Found perl:
   /usr/bin/perl

Found java:
   /cm/shared/apps/java/jdk1.8.0_40/bin/java
   java version "1.8.0_40"

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/can   java version "1.8.0_40"

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/can  Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
 Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 6 based on SGE_TASK_ID=1 and offset=5.
Fetch blocks/000012.dat
Fetch blocks/000013.dat
Fetch blocks/000014.dat
Fetch blocks/000015.dat
Fetch blocks/000016.dat
Fetch blocks/000017.dat
Fetch blocks/000018.dat
Fetch blocks/000019.dat
Fetch blocks/000020.dat

Running block 000002 in query 000006

Running with these settings:
--filter-threshold = 1.0E-7
--help = false
--max-shift = 0.2
--min-olap-length = 500
--min-store-length = 0
--no-rc = false
--no-self = true
--no-tf = false
--num-hashes = 768
--num-min-matches = 2
--num-threads = 8
--ordered-kmer-size = 12
--ordered-sketch-size = 1536
--repeat-idf-scale = 10.0
--repeat-weight = 0.9
--settings = 0
--store-full-id = true
--suprRunning with these settings:
--filter-threshold = 1.0E-7
--help = false
--max-shift = 0.2
--min-olap-length = 500
--min-store-length = 0
--no-rc = false
--no-self = false
--no-tf = false
--num-hashes = 768
--num-min-matches = 2
--num-threads = 8
--ordered-kmer-size = 12
--ordered-sketch-size = 1536
--repeat-idf-scale = 10.0
--repeat-weight = 0.9
--settings = 0
--store-full-id = true
--supreRunning with these settings:
--filter-threshold = 1.0E-7
--help = false
--max-shift = 0.2
--min-olap-length = 500
--min-store-length = 0
--no-rc = false
--no-self = true
--no-tf = false
--num-hashes = 768
--num-min-matches = 2
--num-threads = 8
--ordered-kmer-size = 12
--ordered-sketch-size = 1536
--repeat-idf-scale = 10.0
--repeat-weight = 0.9
--settings = 0
--store-full-id = true
--supress-noise = 0
--threshold = 0.78
--version = false
-f = 
-h = false
-k = 16
-p = 
-q = queries/000011
-s = ./blocks/000003.dat

Processing files for storage in reverse index...
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
Current # sequences loaded and processed from file: 20000...
Current # sequences loaded and processed from file: 25000...
Current # sequences stored: 5000...
Current # sequences stored: 10000...
Current # sequences stored: 15000...
Current # sequences stored: 20000...
Current # sequences stored: 25000...
Stored 27000 sequences in the index.
Processed 27000 unique sequences (fwd and rev).
Time (s) to read and hash from file: 9.368463321
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000022.dat.
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
ta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000002.dat.
Current # sequences loaded and processed from file: 10000...
pps/java/jdk1.8.0_40/bin/java
   java version "1.8.0_40"

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 16 based on SGE_TASK_ID=1 and offset=15.
Fetch blocks/000032.dat
Fetch blocks/000033.dat
Fetch blocks/000034.dat
Fetch blocks/000035.dat

Running block 000004 in query 000016

Processed 13500 to sequences.
Time (s) to score, hash to-file, and output: 388.81848018700003
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000023.dat.
lse
--num-hashes = 768
--num-min-matches = 2
--num-threads = 8
--ordered-kmer-size = 12
--ordered-sketch-size = 1536
--repeat-idf-scale = 10.0
--repeat-weight = 0.9
--settings = 0
--store-full-id = true
--supress-noise = 0
--threshold = 0.78
--version = false
-f = 
-h = false
-k = 16
-p = 
-q = queries/000016
-s = ./blocks/000004.dat

Processing files for storage in reverse index...
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
Current # sequences loaded and processed from file: 20000...
Current # sequences loaded and processed from file: 25000...
Current # sequences stored: 5000...
Current # sequences stored: 10000...
Current # sequences stored: 15000...
Current # sequences stored: 20000...
Current # sequences stored: 25000...
Stored 27000 sequences in the index.
Processed 27000 unique sequences (fwd and rev).
Time (s) to read and hash from file: 71.597211915
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000032.dat.
Current # sequences loaded and processed from file: 5000...
e, and output: 446.66281878700005
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000013.dat.
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
, and output: 442.119920625
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000003.dat.
Current # sequences loaded and processed from file: 10000...
Processed 13500 to sequences.
Time (s) to score, hash to-file, and output: 396.61825338100005
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000014.dat.
Processed 13500 to sequences.
Time (s) to score, hash to-file, and output: 452.641465598
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000033.dat.
Processed 13500 to sequences.
Time (s) to score, hash to-file, and output: 525.7606683160001
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000024.dat.
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
, and output: 436.716022251
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000004.dat.
Current # sequences loaded and processed from file: 10000...
Processed 13500 to sequences.
Time (s) to score, hash to-file, and output: 460.181400133
Opened fasta file /ibers/ernie/sProcessed 13500 to sequences.
Time (s) to score, hash to-file, and output: 434.45696625700003
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000034.dat.
Processed 13500 to sequences.
Time (s) to score, hash to-file, and output: 478.818457217
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000025.dat.
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 5000...
Processed 13500 to sequences.
Time (s) to score, hash to-file, and output: 451.241507886
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000005.dat.
Current # sequences loaded and processed from file: 10000...
Processed 13500 to sequences.
Time (s) to score, hash to-file, and output: 391.620046946
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000016.dat.
Processed 13500 to sequences.
Time (s) to score, hash to-file, and output: 334.040386351
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000026.dat.
Current # sequences loaded and processed from file: 5000...
Processed 13500 to sequences.
Time (s) to score, hash to-file, and output: 481.10990021000003
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000035.dat.
Processed 52 to sequences.
Time (s) to score, hash to-file, and output: 2.126424492
Total scoring time (s): 1370.3603578260002
Total time (s): 1441.957862721
MinHash search time (s): 114.861784725
Total matches found: 57094315
Average number of matches per lookup: 1407.9284622213454
Average number of table elements processed per lookup: 6239.98387255869
Average number of table elements processed per match: 4.432031910707748
Average % of hashed sequences hit per lookup: 12.945912701022191
Average % of hashed sequences hit that are matches: 40.27950736657852
Average % of hashed sequences fully compared that are matches: 91.47507885824785
/ibers/ernie/home/ans74/canu/Linux-amd64/bin/mhapConvert: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /ibers/ernie/home/ans74/canu/Linux-amd64/bin/mhapConvert)
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Processed 13500 to sequences.
Time (s) to score, hash to-file, and output: 265.41871118
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000027.dat.
Processed 13500 to sequences.
Time (s) to score, hash to-file, and output: 429.50884629
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000006.dat.
Processed 13500 to sequences.
Time (s) to score, hash to-file, and output: 387.03043724400004
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000017.dat.
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 10000...
, and output: 363.21103995600004
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000028.dat.
Processed 13500 to sequences.
Time (s) to score, hash to-file, and output: 422.926341099
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000007.dat.
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 10000...
, and output: 403.628449631
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000029.dat.
Processed 13500 to sequences.
Time (s) to score, hash to-file, and output: 357.499472409
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000019.dat.
Current # sequences loaded and processed from file: 5000...
e, and output: 425.095174063
Opened fasta file /ibers/ernie/scratch/ans74/correction/1-overlapper/blocks/000008.dat.
...

And it seems memory is being correctly requested as well:

$ cat /ibers/ernie/scratch/ans74/correction/1-overlapper/mhap.jobSubmit-01.sh
#!/bin/sh

qsub \
  -pe multithread 8 -l h_vmem=1280m -S /bin/bash -l h_vmem=20g -j y -o mhap.\$TASK_ID.out \
  -cwd -N "cormhap_moncler" \
  -t 1-5 \
  ./mhap.sh 0 \
> ./mhap.jobSubmit-01.out 2>&1

Will try with 40gb now and get back you.

skoren commented 5 years ago

If it's running with 20g then increasing won't change anything. There is no error in that posted log. What was the error you got from Canu? Is that the complete log? Is there a default time limit?

GeoMicroSoares commented 5 years ago

Error from canu is still the same:

$ tail /ibers/ernie/scratch/ans74/canu.out -n90
--
-- BEGIN CORRECTION
--
--

-- Mhap overlap jobs failed, tried 2 times, giving up.
--   job correction/1-overlapper/results/000001.ovb FAILED.
--   job correction/1-overlapper/results/000002.ovb FAILED.
--   job correction/1-overlapper/results/000003.ovb FAILED.
--   job correction/1-overlapper/results/000004.ovb FAILED.
--   job correction/1-overlapper/results/000005.ovb FAILED.
--   job correction/1-overlapper/results/000006.ovb FAILED.
--   job correction/1-overlapper/results/000007.ovb FAILED.
--   job correction/1-overlapper/results/000008.ovb FAILED.
--   job correction/1-overlapper/results/000009.ovb FAILED.
--   job correction/1-overlapper/results/000010.ovb FAILED.
--   job correction/1-overlapper/results/000011.ovb FAILED.
--   job correction/1-overlapper/results/000012.ovb FAILED.
--   job correction/1-overlapper/results/000013.ovb FAILED.
--   job correction/1-overlapper/results/000014.ovb FAILED.
--   job correction/1-overlapper/results/000015.ovb FAILED.
--   job correction/1-overlapper/results/000016.ovb FAILED.
--   job correction/1-overlapper/results/000017.ovb FAILED.
--   job correction/1-overlapper/results/000018.ovb FAILED.
--   job correction/1-overlapper/results/000019.ovb FAILED.
--   job correction/1-overlapper/results/000020.ovb FAILED.
--   job correction/1-overlapper/results/000021.ovb FAILED.
--   job correction/1-overlapper/results/000022.ovb FAILED.
--   job correction/1-overlapper/results/000023.ovb FAILED.
--   job correction/1-overlapper/results/000024.ovb FAILED.
--   job correction/1-overlapper/results/000025.ovb FAILED.
--   job correction/1-overlapper/results/000026.ovb FAILED.
--   job correction/1-overlapper/results/000027.ovb FAILED.
--   job correction/1-overlapper/results/000028.ovb FAILED.
--   job correction/1-overlapper/results/000029.ovb FAILED.
--   job correction/1-overlapper/results/000030.ovb FAILED.
--   job correction/1-overlapper/results/000031.ovb FAILED.
--   job correction/1-overlapper/results/000032.ovb FAILED.
--   job correction/1-overlapper/results/000033.ovb FAILED.
--   job correction/1-overlapper/results/000034.ovb FAILED.
--   job correction/1-overlapper/results/000035.ovb FAILED.
--   job correction/1-overlapper/results/000036.ovb FAILED.
--   job correction/1-overlapper/results/000037.ovb FAILED.
--   job correction/1-overlapper/results/000038.ovb FAILED.
--   job correction/1-overlapper/results/000039.ovb FAILED.
--   job correction/1-overlapper/results/000040.ovb FAILED.
--   job correction/1-overlapper/results/000041.ovb FAILED.
--   job correction/1-overlapper/results/000042.ovb FAILED.
--   job correction/1-overlapper/results/000043.ovb FAILED.
--   job correction/1-overlapper/results/000044.ovb FAILED.
--   job correction/1-overlapper/results/000045.ovb FAILED.
--   job correction/1-overlapper/results/000046.ovb FAILED.
--   job correction/1-overlapper/results/000047.ovb FAILED.
--   job correction/1-overlapper/results/000048.ovb FAILED.
--   job correction/1-overlapper/results/000049.ovb FAILED.
--   job correction/1-overlapper/results/000050.ovb FAILED.
--   job correction/1-overlapper/results/000051.ovb FAILED.
--   job correction/1-overlapper/results/000052.ovb FAILED.
--   job correction/1-overlapper/results/000053.ovb FAILED.
--   job correction/1-overlapper/results/000054.ovb FAILED.
--   job correction/1-overlapper/results/000055.ovb FAILED.
--   job correction/1-overlapper/results/000056.ovb FAILED.
--   job correction/1-overlapper/results/000057.ovb FAILED.
--   job correction/1-overlapper/results/000058.ovb FAILED.
--   job correction/1-overlapper/results/000059.ovb FAILED.
--   job correction/1-overlapper/results/000060.ovb FAILED.
--   job correction/1-overlapper/results/000061.ovb FAILED.
--   job correction/1-overlapper/results/000062.ovb FAILED.
--   job correction/1-overlapper/results/000063.ovb FAILED.
--   job correction/1-overlapper/results/000064.ovb FAILED.
--   job correction/1-overlapper/results/000065.ovb FAILED.
--   job correction/1-overlapper/results/000066.ovb FAILED.
--   job correction/1-overlapper/results/000067.ovb FAILED.
--   job correction/1-overlapper/results/000068.ovb FAILED.
--   job correction/1-overlapper/results/000069.ovb FAILED.
--   job correction/1-overlapper/results/000070.ovb FAILED.
--   job correction/1-overlapper/results/000071.ovb FAILED.
--   job correction/1-overlapper/results/000072.ovb FAILED.
--   job correction/1-overlapper/results/000073.ovb FAILED.
--   job correction/1-overlapper/results/000074.ovb FAILED.
--   job correction/1-overlapper/results/000075.ovb FAILED.
--   job correction/1-overlapper/results/000076.ovb FAILED.
--   job correction/1-overlapper/results/000077.ovb FAILED.
--   job correction/1-overlapper/results/000078.ovb FAILED.
--   job correction/1-overlapper/results/000079.ovb FAILED.
--   job correction/1-overlapper/results/000080.ovb FAILED.
--   job correction/1-overlapper/results/000081.ovb FAILED.
--   job correction/1-overlapper/results/000082.ovb FAILED.
--   job correction/1-overlapper/results/000083.ovb FAILED.
--

ABORT:
ABORT: Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

Don't think there's a default time limit but can ask.

That was not the complete log, here's the last of it:

Found perl:
   /usr/bin/perl

Found java:
   /cm/shared/apps/java/jdk1.8.0_40/bin/java
   java version "1.8.0_40"

Found canu:
   /ibers/ernie/home/ans74/canu/Linux-amd64/bin/can  Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/.Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 74.
g job 81 based on SGE_TASK_ID=1 and offset=80.
Fetch blocks/000034.dat
Fetch blocks/000035.dat

Running block 000033 in query 000081

/ibers/ernie/home/ans74/canu/Linux-amd64/bin/mhapConvert: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /ibers/ernie/home/ans74/canu/Linux-amd64/bin/mhapConvert)
   Canu snapshot v1.8 +151 changes (r9361 db9bfcc26004b26f21686f0c324ab69d37c5a885)

Running job 76 based on SGE_TASK_ID=1 and offset=75.
Fetch blocks/000029.dat
Fetch blocks/000030.dat
Fetch blocks/000031.dat
Fetch blocks/000032.dat
Fetch blocks/000033.dat
Fetch blocks/000034.dat
Fetch blocks/000035.dat

Running block 000028 in query 000076

/ibers/ernie/home/ans74/canu/Linux-amd64/bin/mhapConvert: /usr/lib64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /ibers/ernie/home/ans74/canu/Linux-amd64/bin/mhapConvert)

Does this mean I should load module glib?

skoren commented 5 years ago

The JVM issues are solved and thats running. However, it seems wherever you compiled Canu doesn't have the same OS version/gcc libs as wherever you're running it. Some of your cluster nodes have a different environment. You need to build Canu on the oldest system you want it to run on.

You could also use the pre-built release binary rather than building yourself, just make the change to canu-1.8/Linux-amd64/lib/site_perl/canu/Execution.pm in the same way you did before.

GeoMicroSoares commented 5 years ago

@skoren done but getting a weird error:

$ ./canu-1.8/Linux-amd64/bin/canu -p moncler -d /ibers/ernie/scratch/ans74/ genomeSize=0.5g -nanopore-raw montcler_metagenomes/moncler_surface.fastq java="/cm/shared/apps/java/jdk1.8.0_40/bin/java" gridEngineResourceOption="-pe multithread THREADS -l h_vmem=MEMORY" merylMemory=10 merylThreads=1 cormhapMemory=10 cormhapThreads=1 mhapThreads=8 gridOptionsExecutive="-l h_vmem=30g" gridEngineArrayMaxJobs=5 gridOptions="-S /bin/bash" gridOptionscormhap="-l h_vmem=30g"
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu-1.8/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 73.

usage:   canu [-version] [-citation] \
              [-correct | -trim | -assemble | -trim-assemble] \
              [-s <assembly-specifications-file>] \
               -p <assembly-prefix> \
               -d <assembly-directory> \
               genomeSize=<number>[g|m|k] \
              [other-options] \
              [-pacbio-raw |
               -pacbio-corrected |
               -nanopore-raw |
               -nanopore-corrected] file1 file2 ...

example: canu -d run1 -p godzilla genomeSize=1g -nanopore-raw reads/*.fasta.gz 

  To restrict canu to only a specific stage, use:
    -correct       - generate corrected reads
    -trim          - generate trimmed reads
    -assemble      - generate an assembly
    -trim-assemble - generate trimmed reads and then assemble them

  The assembly is computed in the -d <assembly-directory>, with output files named
  using the -p <assembly-prefix>.  This directory is created if needed.  It is not
  possible to run multiple assemblies in the same directory.

  The genome size should be your best guess of the haploid genome size of what is being
  assembled.  It is used primarily to estimate coverage in reads, NOT as the desired
  assembly size.  Fractional values are allowed: '4.7m' equals '4700k' equals '4700000'

  Some common options:
    useGrid=string
      - Run under grid control (true), locally (false), or set up for grid control
        but don't submit any jobs (remote)
    rawErrorRate=fraction-error
      - The allowed difference in an overlap between two raw uncorrected reads.  For lower
        quality reads, use a higher number.  The defaults are 0.300 for PacBio reads and
        0.500 for Nanopore reads.
    correctedErrorRate=fraction-error
      - The allowed difference in an overlap between two corrected reads.  Assemblies of
        low coverage or data with biological differences will benefit from a slight increase
        in this.  Defaults are 0.045 for PacBio reads and 0.144 for Nanopore reads.
    gridOptions=string
      - Pass string to the command used to submit jobs to the grid.  Can be used to set
        maximum run time limits.  Should NOT be used to set memory limits; Canu will do
        that for you.
    minReadLength=number
      - Ignore reads shorter than 'number' bases long.  Default: 1000.
    minOverlapLength=number
      - Ignore read-to-read overlaps shorter than 'number' bases long.  Default: 500.
  A full list of options can be printed with '-options'.  All options can be supplied in
  an optional sepc file with the -s option.

  Reads can be either FASTA or FASTQ format, uncompressed, or compressed with gz, bz2 or xz.
  Reads are specified by the technology they were generated with, and any processing performed:
    -pacbio-raw         <files>      Reads are straight off the machine.
    -pacbio-corrected   <files>      Reads have been corrected.
    -nanopore-raw       <files>
    -nanopore-corrected <files>

Complete documentation at http://canu.readthedocs.org/en/latest/

ERROR:  Parameter 'gridEngineResourceOption' is not known.
skoren commented 5 years ago

That option was added post 1.8 release, you need to set gridEngineThreadsOption for the threads and gridEngineMemoryOption for the h_vmem. You can always run canu -options to get a full list of supported options.

GeoMicroSoares commented 5 years ago

Thanks for that, I'll let it run and come back to you with news ASAP.

GeoMicroSoares commented 5 years ago

Update on this @skoren : Got a new error:

...
--   job 2-correction/results/0042.cns FAILED.
--   job 2-correction/results/0043.cns FAILED.
--   job 2-correction/results/0044.cns FAILED.
--   job 2-correction/results/0045.cns FAILED.
--   job 2-correction/results/0046.cns FAILED.
--

ABORT:
ABORT: Canu 1.8
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

So re-ran with gridOptionscns="-l h_vmem=30g" gridOptionscor="-l h_vmem=30g" added to the command. Now waiting for news.

GeoMicroSoares commented 5 years ago

Hi again, new error goes as follows:

--
-- BEGIN TRIMMING
--
--
-- Overlap jobs failed, tried 2 times, giving up.
--   job trimming/1-overlapper/001/000001.ovb FAILED.
--   job trimming/1-overlapper/001/000002.ovb FAILED.
--   job trimming/1-overlapper/001/000003.ovb FAILED.
--   job trimming/1-overlapper/001/000004.ovb FAILED.
--   job trimming/1-overlapper/001/000005.ovb FAILED.
--

ABORT:
ABORT: Canu 1.8
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

After deleting all files from previous runs and running the following command (tried giving memory to the other overlap and trimming jobs):

./canu-1.8/Linux-amd64/bin/canu -p moncler -d /ibers/ernie/scratch/ans74/ genomeSize=0.5g -nanopore-raw montcler_metagenomes/moncler_surface.fastq java="/cm/shared/apps/java/jdk1.8.0_40/bin/java" gridEngineThreadsOption="-pe multithread THREADS" gridEngineMemoryOption="-l h_vmem=MEMORY" merylMemory=10 merylThreads=1 cormhapMemory=10 cormhapThreads=1 mhapThreads=8 gridOptionsExecutive="-l h_vmem=30g" gridEngineArrayMaxJobs=5 gridOptions="-S /bin/bash" gridOptionscormhap="-l h_vmem=30g" gridOptionscns="-l h_vmem=30g" gridOptionscor="-l h_vmem=30g" gridOptionscorovl="-l h_vmem=30g" gridOptionsobtovl="-l h_vmem=30g" gridOptionsobtmhap="-l h_vmem=30g" gridOptionsobtmmap="-l h_vmem=30g" stopOnLowCoverage=2

Any ideas @brianwalenz @skoren ? Thanks again!

Update:

$ cat /ibers/ernie/scratch/ans74/trimming/1-overlapper/overlap.1.out
...
Running job 1 based on SGE_TASK_ID=1 and offset=0.

STRING_NUM_BITS          31
OFFSET_BITS              31
STRING_NUM_MASK          2147483647
OFFSET_MASK              2147483647
MAX_STRING_NUM           2147483647

Hash_Mask_Bits           24
bgnHashID                1
bgnHashID                104801
Max_Hash_Data_Len        320002960
Max_Hash_Load            0.800000
Kmer Length              22
Min Overlap Length       500
Max Error Rate           0.120000
Min Kmer Matches         0

Num_PThreads             8

sizeof(Hash_Bucket_t)    216
sizeof(Check_Vector_t)   4
sizeof(Hash_Frag_Info_t) 4

HASH_TABLE_SIZE          16777216

hash table size:         3456 MB
hash check array         64 MB
string info              0 MB
string start             0 MB

Initializing 8 work areas.
libgomp: Thread creation failed: Resource temporarily unavailable

Current command: ./canu-1.8/Linux-amd64/bin/canu -p moncler -d /ibers/ernie/scratch/ans74/ genomeSize=0.5g -nanopore-raw montcler_metagenomes/moncler_surface.fastq java="/cm/shared/apps/java/jdk1.8.0_40/bin/java" gridEngineThreadsOption="-pe multithread THREADS" gridEngineMemoryOption="-l h_vmem=MEMORY" merylMemory=10 merylThreads=1 cormhapMemory=10 cormhapThreads=1 mhapThreads=8 gridOptionsExecutive="-l h_vmem=30g" gridEngineArrayMaxJobs=5 gridOptions="-S /bin/bash" gridOptionscormhap="-l h_vmem=30g" gridOptionscns="-l h_vmem=30g" gridOptionscor="-l h_vmem=30g" gridOptionscorovl="-l h_vmem=30g" gridOptionsobtovl="-l h_vmem=30g" gridOptionsobtmhap="-l h_vmem=30g" gridOptionsobtmmap="-l h_vmem=30g" gridOptionsovb="-l h_vmem=40g" gridOptionsovs="-l h_vmem=30g" ovbThreads=1 ovbMemory=30 ovsThreads=1 ovsMemory=30 stopOnLowCoverage=2

brianwalenz commented 5 years ago

You could be hitting system/admin imposed limits on the number of processes or threads per user (e.g., ulimit). Googling the error suggests that it could also be a limit on the stack size.

Edit overlap.sh to add something like this to the start (anywhere works, but there is 'space' left for you after paths are reported and environment variables are set).

echo "Before."
ulimit -a
ulimit -s
ulimit -u
echo "After."
ulimit -a

ulimit -a just reports the current limits. -s and -u try to increase the limit on stack size and number of processes, respectively.

Docs on ulimit: https://ss64.com/bash/ulimit.html

The program hasn't done any significant memory allocations yet, so I don't think that's the problem, but do check the submit commands just in case.

GeoMicroSoares commented 5 years ago

Hi there @brianwalenz , managed to get past this with:

./canu-1.8/Linux-amd64/bin/canu -p moncler -d /ibers/ernie/scratch/ans74/ \
        genomeSize=0.5g stopOnLowCoverage=0 \
        -nanopore-raw montcler_metagenomes/moncler_surface.fastq \
        corMinCoverage=0 corOutCoverage=all corMhapSensitivity=high \
        'correctedErrorRate=0.105' corMaxEvidenceCoverageLocal=10 corMaxEvidenceCoverageGlobal=10 oeaMemory=32 \
        'redMemory=32' batMemory=200 \
        java="/cm/shared/apps/java/jdk1.8.0_40/bin/java" \
        gridEngineThreadsOption="-pe multithread THREADS" gridEngineMemoryOption="-l h_vmem=MEMORY" \
        gridOptionsExecutive="-l h_vmem=30g" gridEngineArrayMaxJobs=5 gridOptions="-S /bin/bash" \
        merylMemory=10 merylThreads=1 \
        cormhapMemory=10 cormhapThreads=1 gridOptionscormhap="-l h_vmem=30g" \
        mhapThreads=8 \
        gridOptionscns="-l h_vmem=30g" \
        gridOptionscor="-l h_vmem=30g" \
        gridOptionscorovl="-l h_vmem=0.5g" corovlThreads=1 corovlMemory=0.5 \
        gridOptionsobtovl="-l h_vmem=10g" obtovlThreads=1 obtovlMemory=10 \
        gridOptionsutgovl="-l h_vmem=10g" utgovlThreads=1 utgovlMemory=10 \
        gridOptionsobtmhap="-l h_vmem=30g" \
        gridOptionsobtmmap="-l h_vmem=30g" \
        gridOptionsovb="-l h_vmem=1g" ovbThreads=1 ovbMemory=1 \
        gridOptionsovs="-l h_vmem=1g" ovsThreads=1 ovsMemory=1 \
        gfaMemory=50g

I am now getting the following out of canu.out though, without much explanation:

-- BEGIN ASSEMBLY
--
--
-- Graph alignment jobs failed, tried 2 times, giving up.
--

ABORT:
ABORT: Canu 1.8
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

Important files that may be useful to troubleshoot below:

$ tail /ibers/ernie/scratch/ans74/unitigging/4-unitigger/alignGFA.1.out
Found perl:
   /usr/bin/perl

Found java:
   /cm/shared/apps/java/jdk1.8.0_40/bin/java
   java version "1.8.0_40"

Found canu:
   /ibers/ernie/home/ans74/canu-1.8/Linux-amd64/bin/canu
Use of implicit split to @_ is deprecated at /ibers/ernie/home/ans74/canu-1.8/Linux-amd64/bin/../lib/site_perl/canu/Grid_Cloud.pm line 73.
   Canu 1.8

GFA alignments failed.
$ tail /ibers/ernie/scratch/ans74/unitigging/4-unitigger/unitigger.err -n50

==> BREAK REPEATS.

computeErrorProfiles()-- Computing error profiles for 30040 tigs, with 16 threads.
computeErrorProfiles()-- Finished.
markRepeatReads()-- Found 979 confused edges.

==> CLEANUP MISTAKES.

splitDiscontinuous()-- Tested 12238 tigs, split 7 tigs into 15 new tigs.
promoteToSingleton()-- Moved 79876 unplaced reads to singleton tigs.
dropDead()-- Modified 281 tigs.  Dropped 129 first and 156 last reads, 4 tigs had both reads dropped.
splitDiscontinuous()-- Tested 12257 tigs, split 15 tigs into 31 new tigs.
promoteToSingleton()-- Moved 0 unplaced reads to singleton tigs.

==> CLEANUP GRAPH.

AssemblyGraph()-- rebuilding
AssemblyGraph()-- building reverse edges.
AssemblyGraph()-- rebuild complete.
AssemblyGraph()-- filtering edges
AssemblyGraph()-- 141846 contig edges and 0 unitig edges.
AssemblyGraph()-- 0 bubble edges and 0 repeat edges.
AssemblyGraph()-- 26850 middle contig edges filtered from 15238 reads.
AssemblyGraph()-- 0 repeat end edges filtered from 0 reads.
AssemblyGraph()-- 26850 repeat edges (not output).
AssemblyGraph()-- 2463 bubble edges.
AssemblyGraph()-- 2463 intersecting edges (from the end of a tig to somewhere else).

==> GENERATE OUTPUTS.

AssemblyGraph()-- generating './moncler.final.assembly.gfa'.
AssemblyGraph()-- Found 0 edges to unassembled contigs.
AssemblyGraph()--        0 bubble placements
AssemblyGraph()--        0 repeat placements

AssemblyGraph()-- Intratig edges:            0 contained         0 5'         0 3' (in both contig and unitig)
AssemblyGraph()-- Contig only edges:     28453 contained    109482 5'    109286 3'
AssemblyGraph()-- Unitig only edges:         0 contained         0 5'         0 3'
AssemblyGraph()-- Intercontig edges:         0 contained      1237 5'      1226 3' (in neither contig nor unitig)
AssemblyGraph()-- generating './moncler.contigs.gfa'.

==> GENERATE UNITIGS.

computeErrorProfiles()-- Computing error profiles for 110849 tigs, with 16 threads.
computeErrorProfiles()-- Finished.
splitDiscontinuous()-- Tested 8966 tigs, split 17 tigs into 34 new tigs.
AssemblyGraph()-- generating './moncler.unitigs.gfa'.

Bye.

Any thoughts, help? Thanks!

skoren commented 5 years ago

Probably insufficient memory, not related to the log you posted. Either edit the alignGFA.jobSubmit-01.sh to increase the memory request and run it by hand or add gridOptionsgfa='-l h_vmem=30g to the canu options.

GeoMicroSoares commented 5 years ago

Thanks so much, just managed to finish running this! Had to push it to 200GB RAM but just finished :) Thanks for all the help! Closing this issue now!