marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
653 stars 179 forks source link

No grid engine detected, grid disabled & running out memory #331

Closed mictadlo closed 7 years ago

mictadlo commented 7 years ago

Hi, With the below command I got No grid engine detected, grid disabled & I ran out memory. I tried to run it on PBS Pro cluster on Suse 12.

canu  -p fruit  -d fruit  genomeSize=1.8g  -pacbio-raw SeQ_8Banana.fastq errorRate=0.013 java=/work/waterhouse_team/miniconda2/bin/java gnuplot=/work/waterhouse_team/miniconda2/bin/gnuplot
-- Canu v1.4 (+0 commits) r7995 7b04cd09002d6b865ca05f4a3f53edb936b5c925.
-- Detected Java(TM) Runtime Environment '1.8.0_92' (from '/work/waterhouse_team/miniconda2/bin/java').
-- Detected gnuplot version '5.0 patchlevel 4' (from '/work/waterhouse_team/miniconda2/bin/gnuplot') and image format 'png'.
-- Detected 48 CPUs and 252 gigabytes of memory.
-- No grid engine detected, grid disabled.
--
-- Allowed to run   3 jobs concurrently, and use up to  16 compute threads and   84 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run   3 jobs concurrently, and use up to  16 compute threads and   32 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run   3 jobs concurrently, and use up to  16 compute threads and   32 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run   3 jobs concurrently, and use up to  16 compute threads and   32 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run   6 jobs concurrently, and use up to   8 compute threads and    8 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run  48 jobs concurrently, and use up to   1 compute thread  and    2 GB memory for stage 'overlap error adjustment'.
-- Allowed to run   6 jobs concurrently, and use up to   8 compute threads and   84 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run  48 jobs concurrently, and use up to   1 compute thread  and    4 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run  48 jobs concurrently, and use up to   1 compute thread  and   32 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run  48 jobs concurrently, and use up to   1 compute thread  and    5 GB memory for stage 'overlapper'.
-- Allowed to run   6 jobs concurrently, and use up to   8 compute threads and   12 GB memory for stage 'overlapper'.
-- Allowed to run   6 jobs concurrently, and use up to   8 compute threads and   12 GB memory for stage 'overlapper'.
-- Allowed to run   2 jobs concurrently, and use up to  24 compute threads and  126 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run  12 jobs concurrently, and use up to   4 compute threads and   21 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run   3 jobs concurrently, and use up to  16 compute threads and   32 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run   3 jobs concurrently, and use up to  16 compute threads and   32 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run   3 jobs concurrently, and use up to  16 compute threads and   32 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #2, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--   
--   genomeSize          -- 1800000000
--   errorRate           -- 0.013
--   
--   corOvlErrorRate     -- 0.039
--   obtOvlErrorRate     -- 0.039
--   utgOvlErrorRate     -- 0.039
--   
--   obtErrorRate        -- 0.039
--   
--   cnsErrorRate        -- 0.039
--
--
-- BEGIN CORRECTION
--
-- Meryl finished successfully.
----------------------------------------
-- Starting command on Tue Jan 10 08:54:05 2017 with 1522811.347 GB free disk space

    /lustre/work-lustre/waterhouse_team/apps/canu-1.4/Linux-amd64/bin/meryl \
      -Dh \
      -s /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/0-mercounts/fruit.ms16 \
    > /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/0-mercounts/fruit.ms16.histogram \
    2> /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/0-mercounts/fruit.ms16.histogram.info

-- Finished on Tue Jan 10 08:54:06 2017 (1 second) with 1522811.347 GB free disk space
----------------------------------------
-- For mhap overlapping, set repeat k-mer threshold to 289264.
--
-- Found 28926487620 16-mers; 2117441602 distinct and 85373448 unique.  Largest count 13633346.
--
-- OVERLAPPER (mhap) (correction)
--
-- Set corMhapSensitivity=high based on read coverage of 16.
--
-- PARAMETERS: hashes=768, minMatches=2, threshold=0.73
--
-- Given 32 GB, can fit 48000 reads per block.
-- For 79 blocks, set stride to 19 blocks.
-- Logging partitioning to '/work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/partitioning.log'.
-- Configured 78 mhap precompute jobs.
-- Configured 196 mhap overlap jobs.
-- mhap precompute attempt 1 begins with 0 finished, and 78 to compute.
----------------------------------------
-- Starting concurrent execution on Tue Jan 10 08:55:52 2017 with 1522808.694 GB free disk space (78 processes; 3 concurrently)

    /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.sh 1 > /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.000001.out 2>&1
    /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.sh 2 > /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.000002.out 2>&1
    /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.sh 3 > /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.000003.out 2>&1
=>> PBS: job killed: mem 4831268kb exceeded limit 4194304kb

-----
PBS Job 1518849.pbs
CPU time  : 00:07:24
Wall time : 00:07:38
Mem usage : 4831268kb

How could I fixed the grid engine detection and the memory amount?

Thank you in advance.

Michal

skoren commented 7 years ago

It looks like you were running on a machine with 252 gb of ram but had only requested 4gb. Did you run Canu on a compute node or head node? If it is on a compute node, what was the submit command? Typically Canu will handle the submission for you so you launch it on the head node and it will submit itself automatically. The grid is detected by running common grid polling commands (pbsnodes in the case of PBS). What do you get if you run pbsnodes --version on a compute node? It is likely on your grid the compute jobs aren't allowed to submit jobs, a requirement for Canu.

You can run and submit the steps manually or run on a single machine (see issue #288). Alternatively you can run on a single node. When Canu is running on a single machine, it will try to use all detected memory so you could reserve a single large instance and run Canu without the grid.

mictadlo commented 7 years ago

Hi, the canu command was inserted in the below PBS script:

#!/bin/bash -l
#PBS -N QUT_CanuT1
#PBS -j oe
#PBS -W umask=0027
#PBS -l walltime=72:00:00
#PBS -l nodes=1:ppn=1,mem=60gb

cd $PBS_O_WORKDIR

canu  -p fruit  -d fruit-auto  genomeSize=1.8g  -pacbio-raw SeQ_8Banana.fastq errorRate=0.013 java=/work/waterhouse_team/miniconda2/bin/java gnuplot=/work/waterhouse_team/miniconda2/bin/gnuplot 

I submitted the above command with qsub fruit.pbs on head node and qstat -u lorencm showed me the bellow jobs:

                                                            Req'd  Req'd   Elap
Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1518788.pbs     lorencm  quick    meryl_frui  29132   1  12  126gb 01:00 R 00:00
1518789.pbs     lorencm  quick    canu_fruit    --    1   1 4096mb 01:00 H   -- 

I do not understand where 4GB came from and I think on our grid the compute jobs are allowed to submit jobs. Is it correct @hodgett?

We would like to be able to get canu automatically running.

What did I miss?

Thank you in advance.

Michal

mictadlo commented 7 years ago

P.S. Yes, our compute nodes are allowed to submit jobs.

skoren commented 7 years ago

Interesting, are you sure all your nodes are allowed to submit? It looks like it started running and the first node was able to submit jobs (1518788). That is where the 4gb limit came from, this is how much the Canu executive reserves for itself.

However, the next job (1518789) couldn't run pbsnodes. If you can find out what machine that job landed on and test to make sure pbsnodes is working. Are there any options required on your grid to make sure jobs can submit? You can see the exact submit command used for job 1518789 in canu-scripts/canu.01.out.

Also, your options to the initial Canu command will be lost by the next submission, if you want Canu to use those you need to add them as gridEngineOptions: "gridEngineOptions=-W umask=0027 -l walltime=72:00:00"

mictadlo commented 7 years ago

Now I got ERROR: Paramter 'gridEngineOptions' is not known.

skoren commented 7 years ago

It should be gridOptions, sorry.

Did you check which machine the second job was scheduled on and whether pbsnodes works properly there? Adding gridOptions won't fix the error unless one of the options is required to make pbsnodes work.

mictadlo commented 7 years ago

I changed it, but I still got =>> PBS: job killed: mem 4428464kb exceeded limit 4194304kb with the below PBS script:

#!/bin/bash -l
#PBS -N QUT_CanuT1
#PBS -j oe
#PBS -l nodes=1:ppn=4,walltime=96:00:00,mem=25gb 
#PBS -W umask=0007

cd $PBS_O_WORKDIR

canu -p fruit -d fruit-auto genomeSize=1.8g -pacbio-raw SeQ_8fruit.fastq errorRate=0.013 "gridOptions=-W umask=0027 -l walltime=72:00:00" java=/work/waterhouse_team/miniconda2/bin/java gnuplot=/work/waterhouse_team/miniconda2/bin/gnuplot

and this is the log:

-- Canu v0.0 (+0 commits) r0 unknown-hash-tag-no-repository-available.
-- Detected Java(TM) Runtime Environment '1.8.0_92' (from '/work/waterhouse_team/miniconda2/bin/java').
-- Detected gnuplot version '5.0 patchlevel 4' (from '/work/waterhouse_team/miniconda2/bin/gnuplot') and image format 'png'.
-- Detected 48 CPUs and 252 gigabytes of memory.
-- No grid engine detected, grid disabled.
--
-- Allowed to run   3 jobs concurrently, and use up to  16 compute threads and   84 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run   3 jobs concurrently, and use up to  16 compute threads and   32 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run   3 jobs concurrently, and use up to  16 compute threads and   32 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run   3 jobs concurrently, and use up to  16 compute threads and   32 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run   6 jobs concurrently, and use up to   8 compute threads and    8 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run  48 jobs concurrently, and use up to   1 compute thread  and    2 GB memory for stage 'overlap error adjustment'.
-- Allowed to run   6 jobs concurrently, and use up to   8 compute threads and   84 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run  48 jobs concurrently, and use up to   1 compute thread  and    4 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run  48 jobs concurrently, and use up to   1 compute thread  and   32 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run  48 jobs concurrently, and use up to   1 compute thread  and    5 GB memory for stage 'overlapper'.
-- Allowed to run   6 jobs concurrently, and use up to   8 compute threads and   12 GB memory for stage 'overlapper'.
-- Allowed to run   6 jobs concurrently, and use up to   8 compute threads and   12 GB memory for stage 'overlapper'.
-- Allowed to run   2 jobs concurrently, and use up to  24 compute threads and  126 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run  12 jobs concurrently, and use up to   4 compute threads and   21 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run   3 jobs concurrently, and use up to  16 compute threads and   32 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run   3 jobs concurrently, and use up to  16 compute threads and   32 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run   3 jobs concurrently, and use up to  16 compute threads and   32 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #2, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--   
--   genomeSize          -- 1800000000
--   errorRate           -- 0.013
--   
--   corOvlErrorRate     -- 0.039
--   obtOvlErrorRate     -- 0.039
--   utgOvlErrorRate     -- 0.039
--   
--   obtErrorRate        -- 0.039
--   
--   cnsErrorRate        -- 0.039
--
--
-- BEGIN CORRECTION
--
-- Meryl finished successfully.
----------------------------------------
-- Starting command on Wed Jan 11 12:57:22 2017 with 1517721.815 GB free disk space

    /lustre/work-lustre/waterhouse_team/miniconda2/libexec/meryl \
      -Dh \
      -s /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/0-mercounts/fruit.ms16 \
    > /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/0-mercounts/fruit.ms16.histogram \
    2> /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/0-mercounts/fruit.ms16.histogram.info

-- Finished on Wed Jan 11 12:57:23 2017 (1 second) with 1517721.676 GB free disk space
----------------------------------------
-- For mhap overlapping, set repeat k-mer threshold to 289264.
--
-- Found 28926487620 16-mers; 2117441602 distinct and 85373448 unique.  Largest count 13633346.
--
-- OVERLAPPER (mhap) (correction)
--
-- Set corMhapSensitivity=high based on read coverage of 16.
--
-- PARAMETERS: hashes=768, minMatches=2, threshold=0.73
--
-- Given 32 GB, can fit 48000 reads per block.
-- For 79 blocks, set stride to 19 blocks.
-- Logging partitioning to '/work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/partitioning.log'.
-- Configured 78 mhap precompute jobs.
-- Configured 196 mhap overlap jobs.
-- mhap precompute attempt 1 begins with 0 finished, and 78 to compute.
----------------------------------------
-- Starting concurrent execution on Wed Jan 11 12:59:12 2017 with 1517714.799 GB free disk space (78 processes; 3 concurrently)

    /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.sh 1 > /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.000001.out 2>&1
    /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.sh 2 > /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.000002.out 2>&1
    /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.sh 3 > /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.000003.out 2>&1
=>> PBS: job killed: mem 4428464kb exceeded limit 4194304kb

-----
PBS Job 1535455.pbs
CPU time  : 00:07:16
Wall time : 00:07:29
Mem usage : 4428464kb

Any idea what could I miss?

Thank you in advance.

Michal

brianwalenz commented 7 years ago

This might be caused by submitting canu to the grid, instead of running canu on the head node and letting it do the submission. Try running the same canu command directly on the head node - it's lightweight, just probing the grid, checking for input files, and submitting itself.

If that doesn't work, then, as Sergey pointed out, the compute nodes are failing to find PBS commands (pbsnodes in particular) and without those, it won't be able to use PBS. The best you can do is request an entire node and let canu run there (in fact, this is what canu is doing - it doesn't find PBS, so it instead uses all 48 cores and 250 gb memory on the node).