Closed mictadlo closed 7 years ago
It looks like you were running on a machine with 252 gb of ram but had only requested 4gb. Did you run Canu on a compute node or head node? If it is on a compute node, what was the submit command? Typically Canu will handle the submission for you so you launch it on the head node and it will submit itself automatically. The grid is detected by running common grid polling commands (pbsnodes in the case of PBS). What do you get if you run pbsnodes --version on a compute node? It is likely on your grid the compute jobs aren't allowed to submit jobs, a requirement for Canu.
You can run and submit the steps manually or run on a single machine (see issue #288). Alternatively you can run on a single node. When Canu is running on a single machine, it will try to use all detected memory so you could reserve a single large instance and run Canu without the grid.
Hi, the canu command was inserted in the below PBS script:
#!/bin/bash -l
#PBS -N QUT_CanuT1
#PBS -j oe
#PBS -W umask=0027
#PBS -l walltime=72:00:00
#PBS -l nodes=1:ppn=1,mem=60gb
cd $PBS_O_WORKDIR
canu -p fruit -d fruit-auto genomeSize=1.8g -pacbio-raw SeQ_8Banana.fastq errorRate=0.013 java=/work/waterhouse_team/miniconda2/bin/java gnuplot=/work/waterhouse_team/miniconda2/bin/gnuplot
I submitted the above command with qsub fruit.pbs
on head node and qstat -u lorencm
showed me the bellow jobs:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
1518788.pbs lorencm quick meryl_frui 29132 1 12 126gb 01:00 R 00:00
1518789.pbs lorencm quick canu_fruit -- 1 1 4096mb 01:00 H --
I do not understand where 4GB came from and I think on our grid the compute jobs are allowed to submit jobs. Is it correct @hodgett?
We would like to be able to get canu automatically running.
What did I miss?
Thank you in advance.
Michal
P.S. Yes, our compute nodes are allowed to submit jobs.
Interesting, are you sure all your nodes are allowed to submit? It looks like it started running and the first node was able to submit jobs (1518788). That is where the 4gb limit came from, this is how much the Canu executive reserves for itself.
However, the next job (1518789) couldn't run pbsnodes. If you can find out what machine that job landed on and test to make sure pbsnodes is working. Are there any options required on your grid to make sure jobs can submit? You can see the exact submit command used for job 1518789 in canu-scripts/canu.01.out.
Also, your options to the initial Canu command will be lost by the next submission, if you want Canu to use those you need to add them as gridEngineOptions:
"gridEngineOptions=-W umask=0027 -l walltime=72:00:00"
Now I got ERROR: Paramter 'gridEngineOptions' is not known.
It should be gridOptions, sorry.
Did you check which machine the second job was scheduled on and whether pbsnodes works properly there? Adding gridOptions won't fix the error unless one of the options is required to make pbsnodes work.
I changed it, but I still got =>> PBS: job killed: mem 4428464kb exceeded limit 4194304kb
with the below PBS script:
#!/bin/bash -l
#PBS -N QUT_CanuT1
#PBS -j oe
#PBS -l nodes=1:ppn=4,walltime=96:00:00,mem=25gb
#PBS -W umask=0007
cd $PBS_O_WORKDIR
canu -p fruit -d fruit-auto genomeSize=1.8g -pacbio-raw SeQ_8fruit.fastq errorRate=0.013 "gridOptions=-W umask=0027 -l walltime=72:00:00" java=/work/waterhouse_team/miniconda2/bin/java gnuplot=/work/waterhouse_team/miniconda2/bin/gnuplot
and this is the log:
-- Canu v0.0 (+0 commits) r0 unknown-hash-tag-no-repository-available.
-- Detected Java(TM) Runtime Environment '1.8.0_92' (from '/work/waterhouse_team/miniconda2/bin/java').
-- Detected gnuplot version '5.0 patchlevel 4' (from '/work/waterhouse_team/miniconda2/bin/gnuplot') and image format 'png'.
-- Detected 48 CPUs and 252 gigabytes of memory.
-- No grid engine detected, grid disabled.
--
-- Allowed to run 3 jobs concurrently, and use up to 16 compute threads and 84 GB memory for stage 'bogart (unitigger)'.
-- Allowed to run 3 jobs concurrently, and use up to 16 compute threads and 32 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run 3 jobs concurrently, and use up to 16 compute threads and 32 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run 3 jobs concurrently, and use up to 16 compute threads and 32 GB memory for stage 'mhap (overlapper)'.
-- Allowed to run 6 jobs concurrently, and use up to 8 compute threads and 8 GB memory for stage 'read error detection (overlap error adjustment)'.
-- Allowed to run 48 jobs concurrently, and use up to 1 compute thread and 2 GB memory for stage 'overlap error adjustment'.
-- Allowed to run 6 jobs concurrently, and use up to 8 compute threads and 84 GB memory for stage 'utgcns (consensus)'.
-- Allowed to run 48 jobs concurrently, and use up to 1 compute thread and 4 GB memory for stage 'overlap store parallel bucketizer'.
-- Allowed to run 48 jobs concurrently, and use up to 1 compute thread and 32 GB memory for stage 'overlap store parallel sorting'.
-- Allowed to run 48 jobs concurrently, and use up to 1 compute thread and 5 GB memory for stage 'overlapper'.
-- Allowed to run 6 jobs concurrently, and use up to 8 compute threads and 12 GB memory for stage 'overlapper'.
-- Allowed to run 6 jobs concurrently, and use up to 8 compute threads and 12 GB memory for stage 'overlapper'.
-- Allowed to run 2 jobs concurrently, and use up to 24 compute threads and 126 GB memory for stage 'meryl (k-mer counting)'.
-- Allowed to run 12 jobs concurrently, and use up to 4 compute threads and 21 GB memory for stage 'falcon_sense (read correction)'.
-- Allowed to run 3 jobs concurrently, and use up to 16 compute threads and 32 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run 3 jobs concurrently, and use up to 16 compute threads and 32 GB memory for stage 'minimap (overlapper)'.
-- Allowed to run 3 jobs concurrently, and use up to 16 compute threads and 32 GB memory for stage 'minimap (overlapper)'.
--
-- This is canu parallel iteration #2, out of a maximum of 2 attempts.
--
-- Final error rates before starting pipeline:
--
-- genomeSize -- 1800000000
-- errorRate -- 0.013
--
-- corOvlErrorRate -- 0.039
-- obtOvlErrorRate -- 0.039
-- utgOvlErrorRate -- 0.039
--
-- obtErrorRate -- 0.039
--
-- cnsErrorRate -- 0.039
--
--
-- BEGIN CORRECTION
--
-- Meryl finished successfully.
----------------------------------------
-- Starting command on Wed Jan 11 12:57:22 2017 with 1517721.815 GB free disk space
/lustre/work-lustre/waterhouse_team/miniconda2/libexec/meryl \
-Dh \
-s /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/0-mercounts/fruit.ms16 \
> /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/0-mercounts/fruit.ms16.histogram \
2> /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/0-mercounts/fruit.ms16.histogram.info
-- Finished on Wed Jan 11 12:57:23 2017 (1 second) with 1517721.676 GB free disk space
----------------------------------------
-- For mhap overlapping, set repeat k-mer threshold to 289264.
--
-- Found 28926487620 16-mers; 2117441602 distinct and 85373448 unique. Largest count 13633346.
--
-- OVERLAPPER (mhap) (correction)
--
-- Set corMhapSensitivity=high based on read coverage of 16.
--
-- PARAMETERS: hashes=768, minMatches=2, threshold=0.73
--
-- Given 32 GB, can fit 48000 reads per block.
-- For 79 blocks, set stride to 19 blocks.
-- Logging partitioning to '/work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/partitioning.log'.
-- Configured 78 mhap precompute jobs.
-- Configured 196 mhap overlap jobs.
-- mhap precompute attempt 1 begins with 0 finished, and 78 to compute.
----------------------------------------
-- Starting concurrent execution on Wed Jan 11 12:59:12 2017 with 1517714.799 GB free disk space (78 processes; 3 concurrently)
/work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.sh 1 > /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.000001.out 2>&1
/work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.sh 2 > /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.000002.out 2>&1
/work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.sh 3 > /work/waterhouse_team/CANU_Test_20170106_B/fruit-auto/correction/1-overlapper/precompute.000003.out 2>&1
=>> PBS: job killed: mem 4428464kb exceeded limit 4194304kb
-----
PBS Job 1535455.pbs
CPU time : 00:07:16
Wall time : 00:07:29
Mem usage : 4428464kb
Any idea what could I miss?
Thank you in advance.
Michal
This might be caused by submitting canu to the grid, instead of running canu on the head node and letting it do the submission. Try running the same canu command directly on the head node - it's lightweight, just probing the grid, checking for input files, and submitting itself.
If that doesn't work, then, as Sergey pointed out, the compute nodes are failing to find PBS commands (pbsnodes in particular) and without those, it won't be able to use PBS. The best you can do is request an entire node and let canu run there (in fact, this is what canu is doing - it doesn't find PBS, so it instead uses all 48 cores and 250 gb memory on the node).
Hi, With the below command I got
No grid engine detected, grid disabled
& I ran out memory. I tried to run it on PBS Pro cluster on Suse 12.How could I fixed the grid engine detection and the memory amount?
Thank you in advance.
Michal