Closed icruz1989 closed 6 years ago
Your cluster is complaining that the meryl job cannot be scheduled on the queue. I'd guess the configuration set it up to use a large-memory node that is not available in the FAST partition you requested. The canu configuration message would have the exact information on what resources it configured.
You can either add whatever flags are necessary using gridOptionsMeryl=XXXX
or restrict merylMemory=<max available on FAST>
.
This is my Hello Skore, thank you for the rapid answer, this is how Canu configured the cluster
CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_151' (from '/share/apps/installers/java/jdk1.8.0_151/bin/java') with -d64 support.
-- Detected 32 CPUs and 63 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with 'MaxArraySize' limited to 1000 jobs.
--
-- Found 7 hosts with 32 cores and 62 GB memory under Slurm control.
-- Found 1 host with 48 cores and 252 GB memory under Slurm control.
-- Found 1 host with 24 cores and 62 GB memory under Slurm control.
--
-- (tag)Threads
-- (tag)Memory |
-- (tag) | | algorithm
-- ------- ------ -------- -----------------------------
-- Grid: meryl 252 GB 32 CPUs (k-mer counting)
-- Grid: cormhap 31 GB 16 CPUs (overlap detection with mhap)
-- Grid: obtovl 16 GB 16 CPUs (overlap detection)
-- Grid: utgovl 16 GB 16 CPUs (overlap detection)
-- Grid: ovb 4 GB 1 CPU (overlap store bucketizer)
-- Grid: ovs 32 GB 1 CPU (overlap store sorting)
-- Grid: red 12 GB 8 CPUs (read error detection)
-- Grid: oea 4 GB 1 CPU (overlap error adjustment)
-- Grid: bat 252 GB 16 CPUs (contig construction with bogart)
-- Grid: gfa 16 GB 16 CPUs (GFA alignment and processing)
--
-- Found PacBio uncorrected reads in the input files.
--
-- Generating assembly 'datura_tic23' in '/home/icruz/ensamble_tic23_pacbio'
--
-- Parameters:
--
-- genomeSize 1500000000
--
-- Overlap Generation Limits:
-- corOvlErrorRate 0.2400 ( 24.00%)
-- obtOvlErrorRate 0.0450 ( 4.50%)
-- utgOvlErrorRate 0.0450 ( 4.50%)
--
-- Overlap Processing Limits:
-- corErrorRate 0.3000 ( 30.00%)
-- obtErrorRate 0.0450 ( 4.50%)
-- utgErrorRate 0.0450 ( 4.50%)
-- cnsErrorRate 0.0750 ( 7.50%)
--
--
-- BEGIN CORRECTION
--
----------------------------------------
-- Starting command on Thu Jun 21 00:10:08 2018 with 498.408 GB free disk space
You can see both meryl and bogart were configured to use the 252gb machine, I assume this isn't in FAST? The fix I suggested will work, either request the appropriate queue for both gridOptionsMeryl
and gridOptionsBat
or set merylMemory=62
and batMemory=62
I think that Canu was configuring other partition of the cluster with 252gb. I have set merylMemory=62 and batMemory=62 and now is working. I hope the jobs end well.
Thank you so much
Hello, I had another issue with Canu, the program stopped in Mhap precompute. Can you help me?
-- Mhap precompute jobs failed, tried 2 times, giving up.
-- job correction/1-overlapper/blocks/000001.dat FAILED.
-- job correction/1-overlapper/blocks/000002.dat FAILED.
-- job correction/1-overlapper/blocks/000003.dat FAILED.
-- job correction/1-overlapper/blocks/000004.dat FAILED.
ABORT:
ABORT: Canu 1.7
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting. If that doesn't work, ask for help.
ABORT:
I opened a precompute.2508_99.out file:
Running job 99 based on SLURM_ARRAY_TASK_ID=99 and offset=0.
Dumping reads from 4557001 to 4603500 (inclusive).
Starting mhap precompute.
Error occurred during initialization of VM
Could not allocate metaspace: 1073741824 bytes
Mhap failed.
Canu configuration
----------------------------------------------------
-- Grid: utgovl 16 GB 16 CPUs (overlap detection)
-- Grid: ovb 4 GB 1 CPU (overlap store bucketizer)
-- Grid: ovs 32 GB 1 CPU (overlap store sorting)
-- Grid: red 12 GB 8 CPUs (read error detection)
-- Grid: oea 4 GB 1 CPU (overlap error adjustment)
-- Grid: bat 62 GB 16 CPUs (contig construction with bogart)
-- Grid: gfa 16 GB 16 CPUs (GFA alignment and processing)
--
-- In 'datura_tic23.seqStore', found PacBio reads:
-- Raw: 6868895
-- Corrected: 0
-- Trimmed: 0
--
-- Generating assembly 'datura_tic23' in '/home/icruz/ensamble_tic23_pacbio'
--
-- Parameters:
--
-- genomeSize 1500000000
--
-- Overlap Generation Limits:
-- corOvlErrorRate 0.2400 ( 24.00%)
-- obtOvlErrorRate 0.0450 ( 4.50%)
-- utgOvlErrorRate 0.0450 ( 4.50%)
--
-- Overlap Processing Limits:
-- corErrorRate 0.3000 ( 30.00%)
-- obtErrorRate 0.0450 ( 4.50%)
-- utgErrorRate 0.0450 ( 4.50%)
-- cnsErrorRate 0.0750 ( 7.50%)
See issue #940 or #298, your cluster's java configuration is reserving extra VM space. You should request more overhead for the JVM due to this using gridOptionscormhap
, gridOptionscormhap="--mem=40g"
should probably be enough.
Hello again and thank you for your help!
Canu stopped again in the mhap process. Inside the folder "1-overlapper" not all the mhap*.out were crashed but many of them have failed, this is the last lines of the files that failed:
Indeed many of the mhap.out files are empy
Processed 46500 to sequences. Time (s) to score, hash to-file, and output: 120.49914677100001 Total scoring time (s): 4515.710164072 Total time (s): 4616.731637541 MinHash search time (s): 1593.3438271910002 Total matches found: 226304528 Average number of matches per lookup: 131.53416332461495 Average number of table elements processed per lookup: 1274.823940133682 Average number of table elements processed per match: 9.691960688475486 Average % of hashed sequences hit per lookup: 0.9495405392905912 Average % of hashed sequences hit that are matches: 14.895054857340043 Average % of hashed sequences fully compared that are matches: 62.4967384908283 safeWrite()-- Write failure on ovFile::writeBuffer::sb: No space left on device safeWrite()-- Wanted to write 874417 objects (size=1), wrote 31595. mhapConvert: AS_UTL/AS_UTL_fileIO.C:107: void AS_UTL_safeWrite(FILE, const void, const char, size_t, size_t): Assertion `(__errno_location ()) == 0' failed. /var/spool/slurmd/job02869/slurm_script: line 2310: 2763 Aborted $bin/mhapConvert -S ../../datura_tic23.seqStore -o ./results/$qry.mhap.ovb.WORKING ./results/$qry.mhap
This is a view of how it looks when I use sacct command to check the status of the jobs:
2809_116.ba+ batch local 16 COMPLETED 0:0 2809_117 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_117.ba+ batch local 16 COMPLETED 0:0 2809_118 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_118.ba+ batch local 16 COMPLETED 0:0 2809_119 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_119.ba+ batch local 16 COMPLETED 0:0 2809_120 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_120.ba+ batch local 16 COMPLETED 0:0 2809_121 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_121.ba+ batch local 16 COMPLETED 0:0 2809_122 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_122.ba+ batch local 16 COMPLETED 0:0 2809_123 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_123.ba+ batch local 16 COMPLETED 0:0 2809_124 cormhap_d+ FAST local 16 FAILED 1:0 2809_124.ba+ batch local 16 FAILED 1:0 2809_125 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_125.ba+ batch local 16 COMPLETED 0:0 2809_126 cormhap_d+ FAST local 16 FAILED 1:0 2809_126.ba+ batch local 16 FAILED 1:0 2809_127 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_127.ba+ batch local 16 COMPLETED 0:0 2809_128 cormhap_d+ FAST local 16 FAILED 1:0 2809_128.ba+ batch local 16 FAILED 1:0 2809_129 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_129.ba+ batch local 16 COMPLETED 0:0 2809_130 cormhap_d+ FAST local 16 FAILED 1:0 2809_130.ba+ batch local 16 FAILED 1:0 2809_131 cormhap_d+ FAST local 16 COMPLETED 0:0 2809_131.ba+ batch local 16 COMPLETED 0:0 2809_132 cormhap_d+ FAST local 16 FAILED 1:0 2809_132.ba+ batch local 16 FAILED 1:0 2809_133 cormhap_d+ FAST local 16 FAILED 1:0 2809_133.ba+ batch local 16 FAILED 1:0 2809_134 cormhap_d+ FAST local 16 FAILED 1:0 2809_134.ba+ batch local 16 FAILED 1:0 2809_135 cormhap_d+ FAST local 16 FAILED 1:0 2809_135.ba+ batch local 16 FAILED 1:0 2809_136 cormhap_d+ FAST local 16 FAILED 1:0 2809_136.ba+ batch local 16 FAILED 1:0 2809_137 cormhap_d+ FAST local 16 FAILED 1:0 2809_137.ba+ batch local 16 FAILED 1:0 2809_138 cormhap_d+ FAST local 16 FAILED 1:0 2809_138.ba+ batch local 16 FAILED 1:0
It says your problem right in the logs:
safeWrite()-- Write failure on ovFile::writeBuffer::sb: No space left on device
Therefore, the problem is with the cluster, right? How can I fix this?
You're out of disk space. From your logs you started with about 400gb which is likely not enough for a 1.6gb genome. I'd expect you need 1-2tb, perhaps more if you have a very repetitive genome. So the fix would be to run on a cluster/partition with more available space.
Issue has drifted a bit from the initial error (which was a partition request) so if you encounter errors running on a larger disk partition, open a new issue.
Hello, I am trying to use Canu to assembly a plant genome of 1.5 GB the command that I am using is
Do you lnow what is happening?