sbatch: error: Batch job submission failed: Requested node configuration is not available

HangweiXi commented 5 years ago

Hi

I'm trying to use Canu to assembly a plant genome on AWS. The command is

canu -correct -p vetch_pacbio -d /shared/assembly useGrid=true gridOptions="--mem=368GB -n 48" genomeSize=1.7g -pacbio-raw /shared/pac_bam/stud_pacbio.fasta

It successfully submit a job but finally crash

The canu.out


Found perl:
   /usr/bin/perl
   This is perl 5, version 22, subversion 1 (v5.22.1) built for x86_64-linux-gnu-thread-multi

Found java:
   /shared/Java/jre1.8.0_212/bin/java
   java version "1.8.0_212"

Found canu:
   /shared/Canu/canu-1.8/Linux-amd64/bin/canu
   Canu 1.8

-- Canu 1.8
--
-- CITATIONS
--
-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
-- Genome Res. 2017 May;27(5):722-736.
-- http://doi.org/10.1101/gr.215087.116
-- 
-- Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM.
-- De novo assembly of haplotype-resolved genomes with trio binning.
-- Nat Biotechnol. 2018
-- https//doi.org/10.1038/nbt.4277
-- 
-- Read and contig alignments during correction, consensus and GFA building use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
-- 
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_212' (from 'java') with -d64 support.
-- Detected gnuplot version '5.2 patchlevel 7   ' (from 'gnuplot') and image format 'svg'.
-- Detected 48 CPUs and 374 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /opt/slurm/bin/sinfo.
-- Detected Slurm with 'MaxArraySize' limited to 1000 jobs.
-- 
-- Found   1 host  with  48 cores and  373 GB memory under Slurm control.
--
--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl     62 GB    8 CPUs  (k-mer counting)
-- Grid:  hap       16 GB   48 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   32 GB   16 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    16 GB   16 CPUs  (overlap detection)
-- Grid:  utgovl    16 GB   16 CPUs  (overlap detection)
-- Grid:  ovb        4 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32 GB    1 CPU   (overlap store sorting)
-- Grid:  red       12 GB    8 CPUs  (read error detection)
-- Grid:  oea        4 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      256 GB   16 CPUs  (contig construction with bogart)
-- Grid:  gfa       16 GB   16 CPUs  (GFA alignment and processing)
--
-- Found PacBio uncorrected reads in the input files.
--
-- Generating assembly 'vetch_pacbio' in '/shared/assembly'
--
-- Parameters:
--
--  genomeSize        1700000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)
--
--
-- BEGIN CORRECTION
--
----------------------------------------
-- Starting command on Sun Jun 16 07:15:16 2019 with 5056.038 GB free disk space

    cd .
    /shared/Canu/canu-1.8/Linux-amd64/bin/sqStoreCreate \
      -o ./vetch_pacbio.seqStore.BUILDING \
      -minlength 1000 \
      ./vetch_pacbio.seqStore.ssi \
    > ./vetch_pacbio.seqStore.BUILDING.err 2>&1

-- Finished on Sun Jun 16 07:22:00 2019 (404 seconds) with 5049.651 GB free disk space
----------------------------------------
--
-- In sequence store './vetch_pacbio.seqStore':
--   Found 3173358 reads.
--   Found 24779760444 bases (14.57 times coverage).
--
--   Read length histogram (one '*' equals 17979.22 reads):
--        0   4999 1258546 **********************************************************************
--     5000   9999 1112620 *************************************************************
--    10000  14999 446132 ************************
--    15000  19999 183885 **********
--    20000  24999  87908 ****
--    25000  29999  44037 **
--    30000  34999  21549 *
--    35000  39999  10385 
--    40000  44999   4976 
--    45000  49999   2100 
--    50000  54999    805 
--    55000  59999    290 
--    60000  64999     84 
--    65000  69999     27 
--    70000  74999      9 
--    75000  79999      0 
--    80000  84999      1 
--    85000  89999      1 
--    90000  94999      0 
--    95000  99999      1 
--   100000 104999      1 
--   105000 109999      0 
--   110000 114999      0 
--   115000 119999      1 
----------------------------------------
-- Starting command on Sun Jun 16 07:22:31 2019 with 5049.635 GB free disk space

    cd correction/0-mercounts
    ./meryl-configure.sh \
    > ./meryl-configure.err 2>&1

-- Finished on Sun Jun 16 07:22:33 2019 (2 seconds) with 5049.635 GB free disk space
----------------------------------------
--  segments   memory batches
--  -------- -------- -------
--        01 14.00 GB       1
--        02 13.50 GB       1
--        04 12.00 GB       1
--        06  8.00 GB       1
--        08  7.00 GB       1
--        12  4.00 GB       1
--        16  3.50 GB       1
--        20  3.19 GB       1
--        24  2.69 GB       1
--        32  2.00 GB       1
--        40  1.62 GB       1
--        48  1.38 GB       1
--        56  1.19 GB       1
--        64  1.00 GB       1
--        96  0.69 GB       1
--
--  For 3173358 reads with 24779760444 bases, limit to 247 batches.
--  Will count kmers using 01 jobs, each using 16 GB and 8 threads.
--
-- Finished stage 'merylConfigure', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
--

CRASH:
CRASH: Canu 1.8
CRASH: Please panic, this is abnormal.
ABORT:
CRASH:   Failed to submit batch jobs.
CRASH:
CRASH: Failed at /shared/Canu/canu-1.8/Linux-amd64/bin/../lib/site_perl/canu/Execution.pm line 1233.
CRASH:  canu::Execution::submitOrRunParallelJob("vetch_pacbio", "meryl", "correction/0-mercounts", "meryl-count", 1) called at /shared/Canu/canu-1.8/Linux-amd64/bin/../lib/site_perl/canu/Meryl.pm line 805
CRASH:  canu::Meryl::merylCountCheck("vetch_pacbio", "cor") called at /shared/Canu/canu-1.8/Linux-amd64/bin/canu line 780
CRASH: 
CRASH: Last 50 lines of the relevant log file (correction/0-mercounts/meryl-count.jobSubmit-01.out):
CRASH:
CRASH: sbatch: error: Batch job submission failed: Requested node configuration is not available
CRASH:

The meryl-count.jobSubmit-01.out under correction/0-mercounts

sbatch: error: Batch job submission failed: Requested node configuration is not available

Thank you

skoren commented 5 years ago

I'm not sure how you configured your AWS Slurm, have you checked the Canu FAQ to make sure it supports all the required features (array jobs, hold, etc)? Essentially, it is saying the requested machines don't exist. Canu should be requesting 16gb and 8 cores in a single array job which is below what your Slurm claims exists. You can see the exact request command in correction/0-mercounts/meryl-count.jobSubmit-01.sh. You'll have to figure out if there are other options you need to provide to Slurm to support array jobs. Since you've only got 1 node in your "cluster" anyway, you might as well run with useGrid=false.

HangweiXi commented 5 years ago

Thanks for your reply! I find a way to active other nodes on the grid and Canu is working. I still have some question about it.

When I check the log, I'm not sure what this part exact mean:

--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl    256 GB   32 CPUs  (k-mer counting)
-- Grid:  cormhap   31 GB    8 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    16 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl    16 GB    8 CPUs  (overlap detection)
-- Grid:  ovb        3 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32 GB    1 CPU   (overlap store sorting)
-- Grid:  red        8 GB    4 CPUs  (read error detection)
-- Grid:  oea        4 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      256 GB   16 CPUs  (contig construction)
-- Grid:  gfa       16 GB   16 CPUs  (GFA alignment and processing)

I can understand it should be the requirement of each step. My question is: Is it the requirement for a single job array in each step?

I've booked some big nodes on AWS, both of them have 48 cores and 384GB ram. I set up the gridOptions="-n 48 --mem=370GB". I think that means each of sub-job in the job array will request such resources and it far exceeds the resource requirement. Is that means waist compute resources? Since Canu will detect the resources requirement automatically, is that means Canu will run more efficient without set up gridOptions?

Thank you very much!

skoren commented 5 years ago

Yes, that is a single job, it will request multiple jobs each with those resources
That's wrong don't add resource requests to gridOptions, Canu will automatically request the resources it needs for each step according to 1. Gridoptions should only be used to pass required parameters (like time limits or partition requests).

marbl / canu

sbatch: error: Batch job submission failed: Requested node configuration is not available #1392