Troubles running canue on 2 or more nodes

Malabady commented 5 years ago

Hello Sergey:

We are having some difficulties getting canu job to dispatch on more than one node using the Grid options. I am assembling a large genome with over 100X coverage of Sequel II data.

When I run canu on one node (28cores and 500Gb) with the "useGrid=False", the run starts but it dies before finishing the correction stage with no obvious (to me) error message. So, I assumed it might be resources issue. Here is the code:

#PBS -S /bin/bash
#PBS -q ggbc_q
#PBS -N canu_01
#PBS -l nodes=1:ppn=28
#PBS -l walltime=30:00:00:00
#PBS -l pmem=400gb
module load gnuplot/5.2.2-foss-2018a
module load canu/1.8-Linux-amd64
cd $(pwd)

canu -p run -d rosea2 genomeSize=3.6g  -pacbio-raw ../raw_data/*.subreads.fastq.gz \
corOutCoverage=200 correctedErrorRate=0.05 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50"   useGrid=false

I tired to submit the run to two nodes (28cors and 512GB per node) with enabling the grid option, "useGrid=true", but the run never dispatched even though the target nodes are available.

I added more grid related options to the command line and submitted it to the queue again, but it never started even though the target nodes are available. here is the code:

#PBS -S /bin/bash
#PBS -q ggbc_q
#PBS -N canu_01
#PBS -l nodes=2:ppn=24
#PBS -l walltime=14:00:00:00
#PBS -l pmem=17gb
#PBS -m ae

ml gnuplot/5.2.2-foss-2018a
ml canu/1.8-Linux-amd64

cd $(pwd)

canu -p run -d rosea2 genomeSize=3.6g  -pacbio-raw ../raw_data/*.subreads.fastq.gz \
corOutCoverage=200 correctedErrorRate=0.05 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" \
useGrid=true \
gridEngine=pbs \
gridEngineThreadsOption=-l nodes=2:ppn=THREADS \
gridOptions=-q batch -l walltime=12:00:00:00 \
gridEngineMemoryOption=-l mem=MEMORY

Finally I started an interactive session on the cluster on two nodes (28cores and 500Gb each) and started the above script interactively. It start working but finishes in couple hours before even finishing the correction stage. No clear error message. Here is the report:

-- Canu 1.8
--
-- CITATIONS
--
-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
-- Genome Res. 2017 May;27(5):722-736.
-- http://doi.org/10.1101/gr.215087.116
--
-- Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM.
-- De novo assembly of haplotype-resolved genomes with trio binning.
-- Nat Biotechnol. 2018
-- https//doi.org/10.1038/nbt.4277
--
-- Read and contig alignments during correction, consensus and GFA building use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
--
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
--
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
--
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
--
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
--
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_144' (from '/usr/local/apps/eb/Java/1.8.0_144/bin/java') with -d64 support.
-- Detected gnuplot version '5.2 patchlevel 2   ' (from 'gnuplot') and image format 'png'.
-- Detected 28 CPUs and 503 gigabytes of memory.
-- Detecting PBS/Torque resources.
--
-- Found   1 host  with  12 cores and   78 GB memory under PBS/Torque control.
-- Found   5 hosts with  64 cores and  125 GB memory under PBS/Torque control.
-- Found   4 hosts with  16 cores and  125 GB memory under PBS/Torque control.
-- Found   3 hosts with  12 cores and   94 GB memory under PBS/Torque control.
-- Found  17 hosts with  28 cores and  251 GB memory under PBS/Torque control.
-- Found   1 host  with  48 cores and   92 GB memory under PBS/Torque control.
-- Found   6 hosts with  28 cores and  503 GB memory under PBS/Torque control.
-- Found  10 hosts with  24 cores and  125 GB memory under PBS/Torque control.
-- Found   2 hosts with  32 cores and  376 GB memory under PBS/Torque control.
-- Found  12 hosts with  32 cores and   92 GB memory under PBS/Torque control.
-- Found  58 hosts with  28 cores and   62 GB memory under PBS/Torque control.
-- Found  20 hosts with  48 cores and  251 GB memory under PBS/Torque control.
-- Found  10 hosts with  32 cores and  503 GB memory under PBS/Torque control.
-- Found   8 hosts with  12 cores and  251 GB memory under PBS/Torque control.
-- Found  48 hosts with  32 cores and  187 GB memory under PBS/Torque control.
-- Found  16 hosts with  32 cores and  125 GB memory under PBS/Torque control.
-- Found   2 hosts with  28 cores and  187 GB memory under PBS/Torque control.
-- Found 127 hosts with  48 cores and  125 GB memory under PBS/Torque control.
-- Found   1 host  with  48 cores and  995 GB memory under PBS/Torque control.
-- Found   7 hosts with  28 cores and  125 GB memory under PBS/Torque control.
-- Found   4 hosts with  28 cores and 1007 GB memory under PBS/Torque control.
-- Found   6 hosts with  48 cores and  503 GB memory under PBS/Torque control.
-- Found   1 host  with  48 cores and  117 GB memory under PBS/Torque control.
-- Found   3 hosts with  32 cores and  251 GB memory under PBS/Torque control.
--
--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl     25 GB    8 CPUs  (k-mer counting)
-- Grid:  hap       16 GB   16 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   17 GB    8 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    24 GB   12 CPUs  (overlap detection)
-- Grid:  utgovl    24 GB   12 CPUs  (overlap detection)
-- Grid:  ovb        4 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32 GB    1 CPU   (overlap store sorting)
-- Grid:  red        8 GB    4 CPUs  (read error detection)
-- Grid:  oea        8 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      512 GB   32 CPUs  (contig construction with bogart)
-- Grid:  gfa       32 GB   32 CPUs  (GFA alignment and processing)
--
-- Found PacBio uncorrected reads in the input files.
--
-- Generating assembly 'run' in '/scratch/malabady/PitcherGenome/PitchPacBio/canu_assembly/rosea3'
--
-- Parameters:
--
--  genomeSize        3600000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0500 (  5.00%)
--    utgOvlErrorRate 0.0500 (  5.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0500 (  5.00%)
--    utgErrorRate    0.0500 (  5.00%)
--    cnsErrorRate    0.0500 (  5.00%)
--
--
-- BEGIN CORRECTION
--
----------------------------------------
-- Starting command on Tue Jul 16 09:54:14 2019 with 568298.629 GB free disk space

    cd .
    /usr/local/apps/eb/canu/1.8-Linux-amd64/bin/sqStoreCreate \
      -o ./run.seqStore.BUILDING \
      -minlength 1000 \
      ./run.seqStore.ssi \
    > ./run.seqStore.BUILDING.err 2>&1

-- Finished on Tue Jul 16 11:19:42 2019 (5128 seconds) with 568554.755 GB free disk space
----------------------------------------
--
-- In sequence store './run.seqStore':
--   Found 22040437 reads.
--   Found 421041136151 bases (116.95 times coverage).
--
--   Read length histogram (one '*' equals 62788.12 reads):
--        0   4999 4395169 **********************************************************************
--     5000   9999 3981572 ***************************************************************
--    10000  14999 2982785 ***********************************************
--    15000  19999 2331024 *************************************
--    20000  24999 1843113 *****************************
--    25000  29999 1479309 ***********************
--    30000  34999 1238475 *******************
--    35000  39999 1055912 ****************
--    40000  44999 868544 *************
--    45000  49999 663686 **********
--    50000  54999 464585 *******
--    55000  59999 302441 ****
--    60000  64999 184705 **
--    65000  69999 107433 *
--    70000  74999  60342
--    75000  79999  33628
--    80000  84999  18654
--    85000  89999  10691
--    90000  94999   6342
--    95000  99999   3960
--   100000 104999   2726
--   105000 109999   1816
--   110000 114999   1201
--   115000 119999    800
--   120000 124999    573
--   125000 129999    367
--   130000 134999    229
--   135000 139999    153
--   140000 144999     70
--   145000 149999     48
--   150000 154999     24
--   155000 159999     20
--   160000 164999     17
--   165000 169999      8
--   170000 174999      3
--   175000 179999      3
--   180000 184999      2
--   185000 189999      1
--   190000 194999      2
--   195000 199999      1
--   200000 204999      1
--   205000 209999      0
--   210000 214999      0
--   215000 219999      0
--   220000 224999      0
--   225000 229999      1
--   230000 234999      1
----------------------------------------
-- Starting command on Tue Jul 16 11:22:32 2019 with 568528.994 GB free disk space

    cd correction/0-mercounts
    ./meryl-configure.sh \
    > ./meryl-configure.err 2>&1

-- Finished on Tue Jul 16 11:22:44 2019 (12 seconds) with 568526.087 GB free disk space
----------------------------------------
--  segments   memory batches
--  -------- -------- -------
--        01 16.00 GB       1
--        02 15.50 GB       1
--        04 15.00 GB       1
--        06 15.00 GB       1
--        08 14.50 GB       1
--        12 14.50 GB       1
--        16 14.00 GB       1
--        20 14.00 GB       1
--        24 14.00 GB       1
--        32 13.50 GB       1
--        40 13.50 GB       1
--        48 13.50 GB       1
--        56 13.00 GB       1
--        64 13.00 GB       1
--        96  8.01 GB       1
--
--  For 22040437 reads with 421041136151 bases, limit to 4210 batches.
--  Will count kmers using 01 jobs, each using 18 GB and 8 threads.
--
-- Finished stage 'merylConfigure', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
--
-- 'meryl-count.jobSubmit-01.sh' -> job 1373451.sapelo2 task 1.
--
----------------------------------------
-- Starting command on Tue Jul 16 11:22:44 2019 with 568526.087 GB free disk space

    cd /scratch/malabady/PitcherGenome/PitchPacBio/canu_assembly/rosea3
    qsub \
      -j oe \
      -d `pwd` \
      -W depend=afterany:1373451.sapelo2 \
      -l mem=4g \
      -l nodes=2:ppn=1 \
      -q ggbc_q \
      -l walltime=14:00:00:00  \
      -N 'canu_run' \
      -o canu-scripts/canu.01.out  canu-scripts/canu.01.sh
1373452.sapelo2

-- Finished on Tue Jul 16 11:22:45 2019 (one second) with 568526.087 GB free disk space

I really appreciate it if you can point out what we are doing wrong here.

Many thanks, Magdy

skoren commented 5 years ago

You shouldn't use nodes=2, canu processes run on a single node, it will just request more than one of those instances at the same time so use node=1.

The output you posted is using useGrid=true, correct? That is the expected output, the way canu runs (see https://canu.readthedocs.io/en/latest/tutorial.html#execution-configuration) is to submit processes to the grid and then itself to wait for those to complete and resume execution. So this is fine, the job should be in your queue (1373451 and 1373452 waiting for the previous one to complete). If those jobs aren't being scheduled, that's an issue with your grid not Canu, you'd have to find out why they aren't being scheduled since they are requested a low amount of memory and CPU.

Malabady commented 5 years ago

Thank you for the clarification. this is really helpful. The original canu command that I invoked interactively we done (see below) and the two child jobs were Held and Queued, although the resources are available. So, I didn't know if the original command was "done" because the child jobs were not submitted or the other way around. I think i will rerun it again and watch closely.

[1]+ Done nohup canu -p run -d rosea2 genomeSize=3.6g -pacbio-raw ../raw_data/XMAGA.20190628.PACBIO_DATA.PART-//Rosea_1///*.subreads.fastq.gz corOutCoverage=200 correctedErrorRate=0.05 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" useGrid=true gridEngine=pbs gridEngineThreadsOption="-l nodes=2:ppn=THREADS" gridEngineMemoryOption="-l mem=MEMORY" gridOptions="-q ggbc_q -l walltime=14:00:00:00" java=/usr/local/apps/eb/Java/1.8.0_144/bin/java

skoren commented 5 years ago

Held means the job is waiting for the Queued one to complete. Queued is just waiting for resources, it's up to your grid to decide when/how to run it. You should check with your IT if you need to specify any additional information in your submit command to make them run. If not you can check why the job is still in the Queue and not being run.

marbl / canu

Troubles running canue on 2 or more nodes #1416