marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
660 stars 179 forks source link

Batch job submission failed: Requested node configuration is not available #963

Closed icruz1989 closed 6 years ago

icruz1989 commented 6 years ago

Hello, I am trying to use Canu to assembly a plant genome of 1.5 GB the command that I am using is

$canu -d ensamble_tic23_pacbio -p datura_tic23 genomeSize=1.5g gridOptions="--time=80:00:00 --partition=FAST" gnuplotTested=true -pacbio-raw /home/icruz/data/secuencias_pacbio/bamfiles_tic_g/tic23.subreads.fastq 1>run2.log &

Do you lnow what is happening?


The job stop here:
CRASH:
CRASH: Canu 1.7
CRASH: Please panic, this is abnormal.
ABORT:
CRASH:   Failed to submit batch jobs.
CRASH:
CRASH: Failed at /share/apps/installers/pipelines/icruz/canu-master/Linux-amd64/bin/../lib/site_perl/canu/Execution.pm line 1203.
CRASH:  canu::Execution::submitOrRunParallelJob("datura_tic23", "meryl", "correction/0-mercounts", "meryl", 1) called at /share/apps/installers/pipelines/icruz/canu-master/Linux-amd64/bin/../lib/site_perl/canu/Meryl.pm line 518
CRASH:  canu::Meryl::merylCheck("datura_tic23", "cor") called at /share/apps/installers/pipelines/icruz/canu-master/Linux-amd64/bin/canu line 603
CRASH:
CRASH: Last 50 lines of the relevant log file (correction/0-mercounts/meryl.jobSubmit-01.out):
CRASH:
CRASH: sbatch: error: Batch job submission failed: Requested node configuration is not available
packet_write_wait: Connection to 132.247.186.44 port 60307: Broken pipe
[github_question_canu.txt](https://github.com/marbl/canu/files/2123242/github_question_canu.txt)
``
[Canu-configuration.txt](https://github.com/marbl/canu/files/2123256/Canu-configuration.txt)
skoren commented 6 years ago

Your cluster is complaining that the meryl job cannot be scheduled on the queue. I'd guess the configuration set it up to use a large-memory node that is not available in the FAST partition you requested. The canu configuration message would have the exact information on what resources it configured.

You can either add whatever flags are necessary using gridOptionsMeryl=XXXX or restrict merylMemory=<max available on FAST>.

icruz1989 commented 6 years ago

This is my Hello Skore, thank you for the rapid answer, this is how Canu configured the cluster

CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_151' (from '/share/apps/installers/java/jdk1.8.0_151/bin/java') with -d64 support.
-- Detected 32 CPUs and 63 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with 'MaxArraySize' limited to 1000 jobs.
--
-- Found   7 hosts with  32 cores and   62 GB memory under Slurm control.
-- Found   1 host  with  48 cores and  252 GB memory under Slurm control.
-- Found   1 host  with  24 cores and   62 GB memory under Slurm control.
--
--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl    252 GB   32 CPUs  (k-mer counting)
-- Grid:  cormhap   31 GB   16 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    16 GB   16 CPUs  (overlap detection)
-- Grid:  utgovl    16 GB   16 CPUs  (overlap detection)
-- Grid:  ovb        4 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32 GB    1 CPU   (overlap store sorting)
-- Grid:  red       12 GB    8 CPUs  (read error detection)
-- Grid:  oea        4 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      252 GB   16 CPUs  (contig construction with bogart)
-- Grid:  gfa       16 GB   16 CPUs  (GFA alignment and processing)
--
-- Found PacBio uncorrected reads in the input files.
--
-- Generating assembly 'datura_tic23' in '/home/icruz/ensamble_tic23_pacbio'
--
-- Parameters:
--
--  genomeSize        1500000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)
--
--
-- BEGIN CORRECTION
--
----------------------------------------
-- Starting command on Thu Jun 21 00:10:08 2018 with 498.408 GB free disk space
skoren commented 6 years ago

You can see both meryl and bogart were configured to use the 252gb machine, I assume this isn't in FAST? The fix I suggested will work, either request the appropriate queue for both gridOptionsMeryl and gridOptionsBat or set merylMemory=62 and batMemory=62

icruz1989 commented 6 years ago

I think that Canu was configuring other partition of the cluster with 252gb. I have set merylMemory=62 and batMemory=62 and now is working. I hope the jobs end well.

Thank you so much

icruz1989 commented 6 years ago

Hello, I had another issue with Canu, the program stopped in Mhap precompute. Can you help me?

-- Mhap precompute jobs failed, tried 2 times, giving up.
--   job correction/1-overlapper/blocks/000001.dat FAILED.
--   job correction/1-overlapper/blocks/000002.dat FAILED.
--   job correction/1-overlapper/blocks/000003.dat FAILED.
--   job correction/1-overlapper/blocks/000004.dat FAILED.
ABORT:
ABORT: Canu 1.7
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

I opened a precompute.2508_99.out file:

Running job 99 based on SLURM_ARRAY_TASK_ID=99 and offset=0.
Dumping reads from 4557001 to 4603500 (inclusive).

Starting mhap precompute.

Error occurred during initialization of VM
Could not allocate metaspace: 1073741824 bytes
Mhap failed.

Canu configuration
----------------------------------------------------
-- Grid:  utgovl    16 GB   16 CPUs  (overlap detection)
-- Grid:  ovb        4 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32 GB    1 CPU   (overlap store sorting)
-- Grid:  red       12 GB    8 CPUs  (read error detection)
-- Grid:  oea        4 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       62 GB   16 CPUs  (contig construction with bogart)
-- Grid:  gfa       16 GB   16 CPUs  (GFA alignment and processing)
--
-- In 'datura_tic23.seqStore', found PacBio reads:
--   Raw:        6868895
--   Corrected:  0
--   Trimmed:    0
--
-- Generating assembly 'datura_tic23' in '/home/icruz/ensamble_tic23_pacbio'
--
-- Parameters:
--
--  genomeSize        1500000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)
skoren commented 6 years ago

See issue #940 or #298, your cluster's java configuration is reserving extra VM space. You should request more overhead for the JVM due to this using gridOptionscormhap, gridOptionscormhap="--mem=40g" should probably be enough.

icruz1989 commented 6 years ago

Hello again and thank you for your help!

Canu stopped again in the mhap process. Inside the folder "1-overlapper" not all the mhap*.out were crashed but many of them have failed, this is the last lines of the files that failed:

Indeed many of the mhap.out files are empy

Processed 46500 to sequences. Time (s) to score, hash to-file, and output: 120.49914677100001 Total scoring time (s): 4515.710164072 Total time (s): 4616.731637541 MinHash search time (s): 1593.3438271910002 Total matches found: 226304528 Average number of matches per lookup: 131.53416332461495 Average number of table elements processed per lookup: 1274.823940133682 Average number of table elements processed per match: 9.691960688475486 Average % of hashed sequences hit per lookup: 0.9495405392905912 Average % of hashed sequences hit that are matches: 14.895054857340043 Average % of hashed sequences fully compared that are matches: 62.4967384908283 safeWrite()-- Write failure on ovFile::writeBuffer::sb: No space left on device safeWrite()-- Wanted to write 874417 objects (size=1), wrote 31595. mhapConvert: AS_UTL/AS_UTL_fileIO.C:107: void AS_UTL_safeWrite(FILE, const void, const char, size_t, size_t): Assertion `(__errno_location ()) == 0' failed. /var/spool/slurmd/job02869/slurm_script: line 2310: 2763 Aborted $bin/mhapConvert -S ../../datura_tic23.seqStore -o ./results/$qry.mhap.ovb.WORKING ./results/$qry.mhap

icruz1989 commented 6 years ago

This is a view of how it looks when I use sacct command to check the status of the jobs:

2809_116.ba+ batch local 16 COMPLETED 0:0 2809_117 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_117.ba+ batch local 16 COMPLETED 0:0 2809_118 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_118.ba+ batch local 16 COMPLETED 0:0 2809_119 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_119.ba+ batch local 16 COMPLETED 0:0 2809_120 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_120.ba+ batch local 16 COMPLETED 0:0 2809_121 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_121.ba+ batch local 16 COMPLETED 0:0 2809_122 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_122.ba+ batch local 16 COMPLETED 0:0 2809_123 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_123.ba+ batch local 16 COMPLETED 0:0 2809_124 cormhap_d+ FAST local 16 FAILED 1:0 2809_124.ba+ batch local 16 FAILED 1:0 2809_125 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_125.ba+ batch local 16 COMPLETED 0:0 2809_126 cormhap_d+ FAST local 16 FAILED 1:0 2809_126.ba+ batch local 16 FAILED 1:0 2809_127 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_127.ba+ batch local 16 COMPLETED 0:0 2809_128 cormhap_d+ FAST local 16 FAILED 1:0 2809_128.ba+ batch local 16 FAILED 1:0 2809_129 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_129.ba+ batch local 16 COMPLETED 0:0 2809_130 cormhap_d+ FAST local 16 FAILED 1:0 2809_130.ba+ batch local 16 FAILED 1:0 2809_131 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_131.ba+ batch local 16 COMPLETED 0:0 2809_132 cormhap_d+ FAST local 16 FAILED 1:0 2809_132.ba+ batch local 16 FAILED 1:0 2809_133 cormhap_d+ FAST local 16 FAILED 1:0 2809_133.ba+ batch local 16 FAILED 1:0 2809_134 cormhap_d+ FAST local 16 FAILED 1:0 2809_134.ba+ batch local 16 FAILED 1:0 2809_135 cormhap_d+ FAST local 16 FAILED 1:0 2809_135.ba+ batch local 16 FAILED 1:0 2809_136 cormhap_d+ FAST local 16 FAILED 1:0 2809_136.ba+ batch local 16 FAILED 1:0 2809_137 cormhap_d+ FAST local 16 FAILED 1:0 2809_137.ba+ batch local 16 FAILED 1:0 2809_138 cormhap_d+ FAST local 16 FAILED 1:0 2809_138.ba+ batch local 16 FAILED 1:0

brianwalenz commented 6 years ago

It says your problem right in the logs:

 safeWrite()-- Write failure on ovFile::writeBuffer::sb: No space left on device
icruz1989 commented 6 years ago

Therefore, the problem is with the cluster, right? How can I fix this?

skoren commented 6 years ago

You're out of disk space. From your logs you started with about 400gb which is likely not enough for a 1.6gb genome. I'd expect you need 1-2tb, perhaps more if you have a very repetitive genome. So the fix would be to run on a cluster/partition with more available space.

skoren commented 6 years ago

Issue has drifted a bit from the initial error (which was a partition request) so if you encounter errors running on a larger disk partition, open a new issue.