marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
660 stars 179 forks source link

Ran Canu Assembler and no assembly file was found. It says "Mhap overlap jobs failed" #2112

Closed KarimAI7 closed 2 years ago

KarimAI7 commented 2 years ago

Here is the canu.out info:

-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '14.0.2' (from '/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/java/14.0.2/bin/java') without -d64 support.
-- Detected gnuplot version '5.2 patchlevel 8   ' (from 'gnuplot') and image format 'png'.
--
-- Detected 1 CPUs and 5120 gigabytes of memory on the local machine.
--
-- Detected Slurm with 'sinfo' binary in /opt/software/slurm/bin/sinfo.
-- Detected Slurm with task IDs up to 9999 allowed.
--
-- Slurm support detected.  Resources available:
--     33 hosts with  64 cores and 2008 GB memory.
--    159 hosts with  48 cores and  497 GB memory.
--    1109 hosts with  64 cores and  248 GB memory.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     12.000 GB    4 CPUs  (k-mer counting)
-- Grid:  hap        8.000 GB    4 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   13.000 GB   16 CPUs  (overlap detection with mhap)
-- Grid:  obtovl     8.000 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl     8.000 GB    8 CPUs  (overlap detection)
-- Grid:  cor        -.--- GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs        8.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       15.000 GB    4 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       64.000 GB    8 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    8 CPUs  (consensus)
--
-- Found Nanopore reads in 'myassembly.seqStore':
--   Libraries:
--     Nanopore:              1701
--   Reads:
--     Raw:                   1978436816
--
--
-- Generating assembly 'myassembly' in '/lustre06/project/6058390/kitani/T_cruzi_data/file_all/canu_assembly':
--   genomeSize:
--     55000000
--
--   Overlap Generation Limits:
--     corOvlErrorRate 0.3200 ( 32.00%)
--     obtOvlErrorRate 0.1200 ( 12.00%)
--     utgOvlErrorRate 0.1200 ( 12.00%)
--
--   Overlap Processing Limits:
--     corErrorRate    0.3000 ( 30.00%)
--     obtErrorRate    0.1200 ( 12.00%)
--     utgErrorRate    0.1200 ( 12.00%)
--     cnsErrorRate    0.2000 ( 20.00%)
--
--   Stages to run:
--     correct raw reads.
--     trim corrected reads.
--     assemble corrected and trimmed reads.
--
--
-- BEGIN CORRECTION
--
-- OVERLAPPER (mhap) (correction) complete, not rewriting scripts.
--
--
-- Mhap overlap jobs failed, tried 2 times, giving up.
--   job correction/1-overlapper/results/000001.ovb FAILED.
--   job correction/1-overlapper/results/000002.ovb FAILED.
--   job correction/1-overlapper/results/000003.ovb FAILED.
--   job correction/1-overlapper/results/000004.ovb FAILED.
--   job correction/1-overlapper/results/000005.ovb FAILED.
--   job correction/1-overlapper/results/000006.ovb FAILED.
--   job correction/1-overlapper/results/000007.ovb FAILED.
--   job correction/1-overlapper/results/000008.ovb FAILED.
--   job correction/1-overlapper/results/000009.ovb FAILED.
--   job correction/1-overlapper/results/000010.ovb FAILED.
--   job correction/1-overlapper/results/000011.ovb FAILED.
--   job correction/1-overlapper/results/000012.ovb FAILED.
--   job correction/1-overlapper/results/000013.ovb FAILED.
--   job correction/1-overlapper/results/000014.ovb FAILED.
--   job correction/1-overlapper/results/000015.ovb FAILED.
--   job correction/1-overlapper/results/000016.ovb FAILED.
--   job correction/1-overlapper/results/000017.ovb FAILED.
--   job correction/1-overlapper/results/000018.ovb FAILED.
--   job correction/1-overlapper/results/000019.ovb FAILED.
--   job correction/1-overlapper/results/000020.ovb FAILED.
--   job correction/1-overlapper/results/000021.ovb FAILED.
--   job correction/1-overlapper/results/000022.ovb FAILED.
--   job correction/1-overlapper/results/000023.ovb FAILED.
--   job correction/1-overlapper/results/000024.ovb FAILED.
--   job correction/1-overlapper/results/000025.ovb FAILED.
--   job correction/1-overlapper/results/000026.ovb FAILED.
--   job correction/1-overlapper/results/000027.ovb FAILED.
--   job correction/1-overlapper/results/000028.ovb FAILED.
--   job correction/1-overlapper/results/000029.ovb FAILED.
--   job correction/1-overlapper/results/000030.ovb FAILED.
--   job correction/1-overlapper/results/000031.ovb FAILED.
--   job correction/1-overlapper/results/000032.ovb FAILED.
--   job correction/1-overlapper/results/000033.ovb FAILED.
--   job correction/1-overlapper/results/000034.ovb FAILED.
--   job correction/1-overlapper/results/000035.ovb FAILED.
--   job correction/1-overlapper/results/000037.ovb FAILED.
--   job correction/1-overlapper/results/000038.ovb FAILED.
--   job correction/1-overlapper/results/000039.ovb FAILED.
--   job correction/1-overlapper/results/000040.ovb FAILED.
--   job correction/1-overlapper/results/000041.ovb FAILED.
--   job correction/1-overlapper/results/000042.ovb FAILED.
--   job correction/1-overlapper/results/000043.ovb FAILED.
--   job correction/1-overlapper/results/000044.ovb FAILED.
--   job correction/1-overlapper/results/000045.ovb FAILED.
--   job correction/1-overlapper/results/000046.ovb FAILED.
--   job correction/1-overlapper/results/000047.ovb FAILED.
--   job correction/1-overlapper/results/000048.ovb FAILED.
--   job correction/1-overlapper/results/000049.ovb FAILED.
--   job correction/1-overlapper/results/000050.ovb FAILED.
--   job correction/1-overlapper/results/000051.ovb FAILED.
--   job correction/1-overlapper/results/000052.ovb FAILED.
--   job correction/1-overlapper/results/000053.ovb FAILED.
--   job correction/1-overlapper/results/000054.ovb FAILED.
--   job correction/1-overlapper/results/000055.ovb FAILED.
--   job correction/1-overlapper/results/000056.ovb FAILED.
--   job correction/1-overlapper/results/000057.ovb FAILED.
--   job correction/1-overlapper/results/000058.ovb FAILED.
--   job correction/1-overlapper/results/000059.ovb FAILED.
--   job correction/1-overlapper/results/000060.ovb FAILED.
--   job correction/1-overlapper/results/000061.ovb FAILED.
--   job correction/1-overlapper/results/000062.ovb FAILED.
--   job correction/1-overlapper/results/000063.ovb FAILED.
--   job correction/1-overlapper/results/000064.ovb FAILED.
--   job correction/1-overlapper/results/000065.ovb FAILED.
--   job correction/1-overlapper/results/000066.ovb FAILED.
--   job correction/1-overlapper/results/000067.ovb FAILED.
--   job correction/1-overlapper/results/000068.ovb FAILED.
--   job correction/1-overlapper/results/000070.ovb FAILED.
--   job correction/1-overlapper/results/000071.ovb FAILED.
--   job correction/1-overlapper/results/000072.ovb FAILED.
--   job correction/1-overlapper/results/000073.ovb FAILED.
--   job correction/1-overlapper/results/000074.ovb FAILED.
--   job correction/1-overlapper/results/000075.ovb FAILED.
--   job correction/1-overlapper/results/000076.ovb FAILED.
--   job correction/1-overlapper/results/000077.ovb FAILED.
--   job correction/1-overlapper/results/000078.ovb FAILED.
--   job correction/1-overlapper/results/000079.ovb FAILED.
--   job correction/1-overlapper/results/000080.ovb FAILED.
--   job correction/1-overlapper/results/000081.ovb FAILED.
--   job correction/1-overlapper/results/000082.ovb FAILED.
--   job correction/1-overlapper/results/000083.ovb FAILED.
--   job correction/1-overlapper/results/000084.ovb FAILED.
--   job correction/1-overlapper/results/000085.ovb FAILED.
--   job correction/1-overlapper/results/000086.ovb FAILED.
--   job correction/1-overlapper/results/000087.ovb FAILED.
--   job correction/1-overlapper/results/000088.ovb FAILED.
--   job correction/1-overlapper/results/000089.ovb FAILED.
--   job correction/1-overlapper/results/000090.ovb FAILED.
--   job correction/1-overlapper/results/000092.ovb FAILED.
--   job correction/1-overlapper/results/000093.ovb FAILED.
--   job correction/1-overlapper/results/000094.ovb FAILED.
--   job correction/1-overlapper/results/000095.ovb FAILED.
--   job correction/1-overlapper/results/000096.ovb FAILED.
--   job correction/1-overlapper/results/000097.ovb FAILED.
--   job correction/1-overlapper/results/000098.ovb FAILED.
--   job correction/1-overlapper/results/000099.ovb FAILED.
--   job correction/1-overlapper/results/000100.ovb FAILED.
--   job correction/1-overlapper/results/000101.ovb FAILED.
--   job correction/1-overlapper/results/000102.ovb FAILED.
--

ABORT:
skoren commented 2 years ago

As I mentioned in #2111, most likely when MHAP jobs fail it's a JVM issue. Post the log from one of the failed jobs (correction/1-overlapper/mhap.*.out)

KarimAI7 commented 2 years ago
Here is the log:
`Found perl:
   /cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/perl/5.30.2/bin/perl
   This is perl 5, version 30, subversion 2 (v5.30.2) built for x86_64-linux-thread-multi

Found java:
   /cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/java/14.0.2/bin/java
   Picked up JAVA_TOOL_OPTIONS: -Xmx2g

Found canu:
   /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/canu/2.2/bin/canu
   canu 2.2

Running job 53 based on SLURM_ARRAY_TASK_ID=53 and offset=0.
Fetch blocks/000027.dat
Fetch blocks/000028.dat
Fetch blocks/000029.dat
Fetch blocks/000030.dat
Fetch blocks/000031.dat
Fetch blocks/000032.dat
Fetch blocks/000033.dat
Fetch blocks/000034.dat
Fetch blocks/000035.dat
Fetch blocks/000036.dat
Fetch blocks/000037.dat

Running block 000015 in query 000053

Picked up JAVA_TOOL_OPTIONS: -Xmx2g
Running with these settings:
--filter-threshold = 1.0E-7
--help = false
--max-shift = 0.2
--min-olap-length = 500
--min-store-length = 0
--no-rc = false
--no-self = true
--no-tf = false
--num-hashes = 512
--num-min-matches = 3
--num-threads = 16
--ordered-kmer-size = 14
--ordered-sketch-size = 1000
--repeat-idf-scale = 10.0
--repeat-weight = 0.9
--settings = 0
--store-full-id = true
--supress-noise = 0
--threshold = 0.78
--version = false
-f =
-h = false
-k = 16
-p =
-q = queries/000053
-s = ./blocks/000015.dat

Processing files for storage in reverse index...
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
Current # sequences loaded and processed from file: 20000...
Current # sequences loaded and processed from file: 25000...
Current # sequences loaded and processed from file: 30000...
Current # sequences loaded and processed from file: 35000...
Current # sequences loaded and processed from file: 40000...
Current # sequences loaded and processed from file: 45000...
Current # sequences loaded and processed from file: 50000...
Current # sequences loaded and processed from file: 55000...
Current # sequences loaded and processed from file: 60000...
Current # sequences loaded and processed from file: 65000...
Current # sequences loaded and processed from file: 70000...
Current # sequences stored: 5000...
Current # sequences stored: 10000...
Current # sequences stored: 15000...
Current # sequences stored: 20000...
Current # sequences stored: 25000...
Current # sequences stored: 30000...
Current # sequences stored: 35000...
Current # sequences stored: 40000...
Current # sequences stored: 45000...
Current # sequences stored: 50000...
Current # sequences stored: 55000...
Current # sequences stored: 60000...
Current # sequences stored: 65000...
Current # sequences stored: 70000...
Stored 70200 sequences in the index.
Processed 70200 unique sequences (fwd and rev).
Time (s) to read and hash from file: 13.204459519
Opened fasta file /lustre06/project/6058390/kitani/T_cruzi_data/file_all/canu_assembly/correction/1-overlapper/blocks/000027.dat.
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
slurmstepd: error: *** JOB 4495565 ON nc11023 CANCELLED AT 2022-04-10T10:37:36 DUE TO TIME LIMIT ***

At the bottom it says "due to time limit" which is weird since I had an hour to spare on the salloc command on ComputeCanada. Could it be that I initially ran it with less time, then when time ran out I re-ran it with additional time? I assumed the files would pick up from where they left off since that is what I read about Canu?

skoren commented 2 years ago

The time you specify for the initial job doesn't get inherited by any of the running jobs. As the FAQ says, you need to explicitly specify time/partitions using gridOptions. Something like gridOptions="--time 36:00:00".

KarimAI7 commented 2 years ago

I saw your comments from #1331

I ended up running the following: canu -p myassembly -d canu_assembly genomeSize=55m -nanopore *.fastq useGrid=true gridOptions="--time 8:00:00"

skoren commented 2 years ago

Did the mhap step run with that option?

KarimAI7 commented 2 years ago

Mhap failed its first attempt, and is trying again:

Canu.out:

--
-- OVERLAPPER (mhap) (correction) complete, not rewriting scripts.
--
--
-- Mhap overlap jobs failed, retry.
--   job correction/1-overlapper/results/000001.ovb FAILED.
--   job correction/1-overlapper/results/000002.ovb FAILED.
--   job correction/1-overlapper/results/000003.ovb FAILED.
--   job correction/1-overlapper/results/000004.ovb FAILED.
--   job correction/1-overlapper/results/000005.ovb FAILED.
--   job correction/1-overlapper/results/000006.ovb FAILED.
--   job correction/1-overlapper/results/000007.ovb FAILED.
--   job correction/1-overlapper/results/000008.ovb FAILED.
--   job correction/1-overlapper/results/000009.ovb FAILED.
--   job correction/1-overlapper/results/000010.ovb FAILED.
--   job correction/1-overlapper/results/000011.ovb FAILED.
--   job correction/1-overlapper/results/000012.ovb FAILED.
--   job correction/1-overlapper/results/000013.ovb FAILED.
--   job correction/1-overlapper/results/000014.ovb FAILED.
--   job correction/1-overlapper/results/000015.ovb FAILED.
--   job correction/1-overlapper/results/000016.ovb FAILED.
--   job correction/1-overlapper/results/000017.ovb FAILED.
--   job correction/1-overlapper/results/000018.ovb FAILED.
--   job correction/1-overlapper/results/000019.ovb FAILED.
--   job correction/1-overlapper/results/000020.ovb FAILED.
--   job correction/1-overlapper/results/000021.ovb FAILED.
--   job correction/1-overlapper/results/000022.ovb FAILED.
--   job correction/1-overlapper/results/000023.ovb FAILED.
--   job correction/1-overlapper/results/000024.ovb FAILED.
--   job correction/1-overlapper/results/000025.ovb FAILED.
--   job correction/1-overlapper/results/000026.ovb FAILED.
--   job correction/1-overlapper/results/000027.ovb FAILED.
--   job correction/1-overlapper/results/000028.ovb FAILED.
--   job correction/1-overlapper/results/000029.ovb FAILED.
--   job correction/1-overlapper/results/000030.ovb FAILED.
--   job correction/1-overlapper/results/000031.ovb FAILED.
--   job correction/1-overlapper/results/000033.ovb FAILED.
--   job correction/1-overlapper/results/000034.ovb FAILED.
--   job correction/1-overlapper/results/000035.ovb FAILED.
--   job correction/1-overlapper/results/000037.ovb FAILED.
--   job correction/1-overlapper/results/000038.ovb FAILED.
--   job correction/1-overlapper/results/000039.ovb FAILED.
--   job correction/1-overlapper/results/000040.ovb FAILED.
--   job correction/1-overlapper/results/000041.ovb FAILED.
--   job correction/1-overlapper/results/000042.ovb FAILED.
--   job correction/1-overlapper/results/000043.ovb FAILED.
--   job correction/1-overlapper/results/000044.ovb FAILED.
--   job correction/1-overlapper/results/000045.ovb FAILED.
--   job correction/1-overlapper/results/000046.ovb FAILED.
--   job correction/1-overlapper/results/000047.ovb FAILED.
--   job correction/1-overlapper/results/000048.ovb FAILED.
--   job correction/1-overlapper/results/000049.ovb FAILED.
--   job correction/1-overlapper/results/000050.ovb FAILED.
--   job correction/1-overlapper/results/000051.ovb FAILED.
--   job correction/1-overlapper/results/000052.ovb FAILED.
--   job correction/1-overlapper/results/000053.ovb FAILED.
--   job correction/1-overlapper/results/000054.ovb FAILED.
--   job correction/1-overlapper/results/000055.ovb FAILED.
--   job correction/1-overlapper/results/000056.ovb FAILED.
--   job correction/1-overlapper/results/000057.ovb FAILED.
--   job correction/1-overlapper/results/000058.ovb FAILED.
--   job correction/1-overlapper/results/000059.ovb FAILED.
--   job correction/1-overlapper/results/000060.ovb FAILED.
--   job correction/1-overlapper/results/000061.ovb FAILED.
--   job correction/1-overlapper/results/000062.ovb FAILED.
--   job correction/1-overlapper/results/000063.ovb FAILED.
--   job correction/1-overlapper/results/000064.ovb FAILED.
--   job correction/1-overlapper/results/000065.ovb FAILED.
--   job correction/1-overlapper/results/000067.ovb FAILED.
--   job correction/1-overlapper/results/000068.ovb FAILED.
--   job correction/1-overlapper/results/000070.ovb FAILED.
--   job correction/1-overlapper/results/000071.ovb FAILED.
--   job correction/1-overlapper/results/000072.ovb FAILED.
--   job correction/1-overlapper/results/000073.ovb FAILED.
--   job correction/1-overlapper/results/000074.ovb FAILED.
--   job correction/1-overlapper/results/000075.ovb FAILED.
--   job correction/1-overlapper/results/000076.ovb FAILED.
--   job correction/1-overlapper/results/000077.ovb FAILED.
--   job correction/1-overlapper/results/000078.ovb FAILED.
--   job correction/1-overlapper/results/000079.ovb FAILED.
--   job correction/1-overlapper/results/000080.ovb FAILED.
--   job correction/1-overlapper/results/000081.ovb FAILED.
--   job correction/1-overlapper/results/000082.ovb FAILED.
--   job correction/1-overlapper/results/000083.ovb FAILED.
--   job correction/1-overlapper/results/000084.ovb FAILED.
--   job correction/1-overlapper/results/000085.ovb FAILED.
--   job correction/1-overlapper/results/000086.ovb FAILED.
--   job correction/1-overlapper/results/000087.ovb FAILED.
--   job correction/1-overlapper/results/000088.ovb FAILED.
--   job correction/1-overlapper/results/000089.ovb FAILED.
--   job correction/1-overlapper/results/000090.ovb FAILED.
--   job correction/1-overlapper/results/000092.ovb FAILED.
--   job correction/1-overlapper/results/000093.ovb FAILED.
--   job correction/1-overlapper/results/000094.ovb FAILED.
--   job correction/1-overlapper/results/000095.ovb FAILED.
--   job correction/1-overlapper/results/000096.ovb FAILED.
--   job correction/1-overlapper/results/000097.ovb FAILED.
--   job correction/1-overlapper/results/000098.ovb FAILED.
--   job correction/1-overlapper/results/000099.ovb FAILED.
--   job correction/1-overlapper/results/000100.ovb FAILED.
--   job correction/1-overlapper/results/000101.ovb FAILED.
--   job correction/1-overlapper/results/000102.ovb FAILED.
--
--
-- Running jobs.  Second attempt out of 2.
--
-- 'mhap.jobSubmit-01.sh' -> job 4582946 tasks 1-31.
-- 'mhap.jobSubmit-02.sh' -> job 4582947 tasks 33-35.
-- 'mhap.jobSubmit-03.sh' -> job 4582948 tasks 37-65.
-- 'mhap.jobSubmit-04.sh' -> job 4582949 tasks 67-68.
-- 'mhap.jobSubmit-05.sh' -> job 4582950 tasks 70-90.
-- 'mhap.jobSubmit-06.sh' -> job 4582951 tasks 92-102.
--
----------------------------------------
-- Starting command on Tue Apr 12 12:33:50 2022 with 0 GB free disk space

    cd /lustre06/project/6058390/kitani/T_cruzi_data/file_all/canu_assembly
    sbatch \
      --depend=afterany:4582946:4582947:4582948:4582949:4582950:4582951 \
      --cpus-per-task=1 \
      --mem-per-cpu=5g \
      --time 8:00:00  \
      -D `pwd` \
      -J 'canu_myassembly' \
      -o canu-scripts/canu.06.out  canu-scripts/canu.06.sh
Submitted batch job 4582952

-- Finished on Tue Apr 12 12:33:52 2022 (2 seconds) with 0 GB free disk space  !!! WARNING !!!

I noticed it says 0GB free disk space, could that be it?

skoren commented 2 years ago

Yes, if you are out of space, the jobs will definitely fail. You should have something like disk write errors in the logs of the failed jobs. It seems most of the jobs failed which means you need quite a bit more space than you have now.

I also noticed in your previous log:

Found java:
   /cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/java/14.0.2/bin/java
   Picked up JAVA_TOOL_OPTIONS: -Xmx2g

Your JVM is set to always use 2gb for the JVM. This is incorrect and should be disabled because canu knows how much memory it will need (in your case it's using 13gb/job) and this will cause it to be very slow or fail when insufficient memory is allocated by the JVM.

KarimAI7 commented 2 years ago

Here is a log file:

Running job 46 based on SLURM_ARRAY_TASK_ID=46 and offset=0.
Fetch blocks/000014.dat
Fetch blocks/000015.dat
Fetch blocks/000016.dat
Fetch blocks/000017.dat
Fetch blocks/000018.dat
Fetch blocks/000019.dat
Fetch blocks/000020.dat
Fetch blocks/000021.dat
Fetch blocks/000022.dat
Fetch blocks/000023.dat
Fetch blocks/000024.dat

Running block 000013 in query 000046

Picked up JAVA_TOOL_OPTIONS: -Xmx2g
Running with these settings:
--filter-threshold = 1.0E-7
--help = false
--max-shift = 0.2
--min-olap-length = 500
--min-store-length = 0
--no-rc = false
--no-self = false
--no-tf = false
--num-hashes = 512
--num-min-matches = 3
--num-threads = 16
--ordered-kmer-size = 14
--ordered-sketch-size = 1000
--repeat-idf-scale = 10.0
--repeat-weight = 0.9
--settings = 0
--store-full-id = true
--supress-noise = 0
--threshold = 0.78
--version = false
-f =
-h = false
-k = 16
-p =
-q = queries/000046
-s = ./blocks/000013.dat

Processing files for storage in reverse index...
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
Current # sequences loaded and processed from file: 20000...
Current # sequences loaded and processed from file: 25000...
Current # sequences loaded and processed from file: 30000...
Current # sequences loaded and processed from file: 35000...
Current # sequences loaded and processed from file: 40000...
Current # sequences loaded and processed from file: 45000...
Current # sequences loaded and processed from file: 50000...
Current # sequences loaded and processed from file: 55000...
Current # sequences loaded and processed from file: 60000...
Current # sequences loaded and processed from file: 65000...
Current # sequences loaded and processed from file: 70000...
Current # sequences stored: 5000...
Current # sequences stored: 10000...
Current # sequences stored: 15000...
Current # sequences stored: 20000...
Current # sequences stored: 25000...
Current # sequences stored: 30000...
Current # sequences stored: 35000...
Current # sequences stored: 40000...
Current # sequences stored: 45000...
Current # sequences stored: 50000...
Current # sequences stored: 55000...
Current # sequences stored: 60000...
Current # sequences stored: 65000...
Current # sequences stored: 70000...
Stored 70200 sequences in the index.
Processed 70200 unique sequences (fwd and rev).
Time (s) to read and hash from file: 12.878285513000002
Time (s) to score and output to self: 5732.143230646
Opened fasta file /lustre06/project/6058390/kitani/T_cruzi_data/file_all/canu_assembly/correction/1-overlapper/blocks/000014.dat.
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
writeToFile()-- After writing 270113 out of 451701 'ovFile::writeBuffer::sb' objects (1 bytes each): Disk quota exceeded
Current # sequences loaded and processed from file: 20000...
Current # sequences loaded and processed from file: 25000...
Current # sequences loaded and processed from file: 30000...
Current # sequences loaded and processed from file: 35000...
Processed 35100 to sequences.
Time (s) to score, hash to-file, and output: 10258.222837360001
Opened fasta file /lustre06/project/6058390/kitani/T_cruzi_data/file_all/canu_assembly/correction/1-overlapper/blocks/000015.dat.
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
Current # sequences loaded and processed from file: 20000...
Current # sequences loaded and processed from file: 25000...
Current # sequences loaded and processed from file: 30000...
Current # sequences loaded and processed from file: 35000...
Processed 35100 to sequences.
Time (s) to score, hash to-file, and output: 9337.448781627001
Opened fasta file /lustre06/project/6058390/kitani/T_cruzi_data/file_all/canu_assembly/correction/1-overlapper/blocks/000016.dat.
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
slurmstepd: error: *** JOB 4571535 ON nc31130 CANCELLED AT 2022-04-12T15:50:32 DUE TO TIME LIMIT ***

It says due to time limit again.

Also regarding JVM this is whats is picked up from Computecanada. I will try to change it.

skoren commented 2 years ago

There are both out of disk errors and timeout:

writeToFile()-- After writing 270113 out of 451701 'ovFile::writeBuffer::sb' objects (1 bytes each): Disk quota exceeded

It's also running very slowly, I expect because of the JVM memory. You can estimate how long this job would take at the current speed by seeing how many files are in 1-overlapper/queries/000046. It finished 2.25 files before being killed so scale that up to estimate total time based on the files there.

Either way, you may as well kill this run and try to get more space. Remove this run folder and re-start from scratch when you do. If you can't disable the JVM option, you can also add mhapMemory=2 which will create more jobs and require you to restart from scratch in a new folder but would fit into the fixed JVM limitation.

KarimAI7 commented 2 years ago

I killed the run and tried running it on a much smaller file. I did not get any space issues but I still got: slurmstepd: error: *** JOB 4571535 ON nc31130 CANCELLED AT 2022-04-12T15:50:32 DUE TO TIME LIMIT ***

I am rerunning with the smaller file and increased gridOptions time from 8 hours to 36. I also added the mhapMemory=2.

Is there a way to run canu on Computecanada without indicating a time? Like can I just run it till completion regardless of how long it takes?

skoren commented 2 years ago

I don't know what the computecanada grid allows. I would guess if it is slurm, it has to specify a runtime. You can increase it beyond 36 hours to whatever your max time limit for the partition is. You can also look at the FAQ for some options that can speed up this step: https://canu.readthedocs.io/en/latest/faq.html#my-assembly-is-running-out-of-space-is-too-slow

KarimAI7 commented 2 years ago

alright, I will cancel my current run and add those parameters. I noticed they said to add mhapMemory=60g which is significantly more than the 2g my JVM is running. Should copy all the other parameters as is and keep mhapMemory=2g?

skoren commented 2 years ago

Yes, keep the memory at 2g because the JVM on your system is hard-coded to that.

KarimAI7 commented 2 years ago

I tried running it with the above parameters. I checked on the run a while later and noticed it stopped on its own. Tried running again and the same happened.

I checked the canu. out file and here is what i got:

-- Detected Slurm with 'sinfo' binary in /opt/software/slurm/bin/sinfo.
-- Detected Slurm with task IDs up to 9999 allowed.
--
-- Slurm support detected.  Resources available:
--     33 hosts with  64 cores and 2008 GB memory.
--    159 hosts with  48 cores and  497 GB memory.
--    1109 hosts with  64 cores and  248 GB memory.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  ------
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '14.0.2' (from '/cvmfs/soft.computecanada.ca/easybuild/software/2020/Core/java/14.0.2/bin/java') without -d64 suppor>
-- Detected gnuplot version '5.2 patchlevel 8   ' (from 'gnuplot') and image format 'png'.
--
-- Detected 1 CPUs and 4096 gigabytes of memory on the local machine.
--
-- Detected Slurm with 'sinfo' binary in /opt/software/slurm/bin/sinfo.
-- Detected Slurm with task IDs up to 9999 allowed.
--
-- Slurm support detected.  Resources available:
--     33 hosts with  64 cores and 2008 GB memory.
--    159 hosts with  48 cores and  497 GB memory.
--    1109 hosts with  64 cores and  248 GB memory.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     12.000 GB    4 CPUs  (k-mer counting)
-- Grid:  hap        8.000 GB    4 CPUs  (read-to-haplotype assignment)
   |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     12.000 GB    4 CPUs  (k-mer counting)
-- Grid:  hap        8.000 GB    4 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap    2.000 GB   16 CPUs  (overlap detection with mhap)
-- Grid:  obtovl     8.000 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl     8.000 GB    8 CPUs  (overlap detection)
-- Grid:  cor        -.--- GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs        8.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       15.000 GB    4 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       64.000 GB    8 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    8 CPUs  (consensus)
--
-- Found Nanopore reads in 'myassembly.seqStore':
--   Libraries:
--     Nanopore:              421
--   Reads:
--     Raw:                   1480740698
--
--
-- Generating assembly 'myassembly' in '/lustre06/project/6058390/kitani/T_cruzi_data/file_n59-b6/canu_assembly':
--   genomeSize:
--     55000000
--
--   Overlap Generation Limits:
--     corOvlErrorRate 0.3200 ( 32.00%)
--     obtOvlErrorRate 0.1200 ( 12.00%)
--     utgOvlErrorRate 0.1200 ( 12.00%)
--
--   Overlap Processing Limits:
--     corErrorRate    0.3000 ( 30.00%)
--     obtErrorRate    0.1200 ( 12.00%)
--     utgErrorRate    0.1200 ( 12.00%)
--     cnsErrorRate    0.2000 ( 20.00%)
--
--   Stages to run:
--     correct raw reads.
--     trim corrected reads.
--     assemble corrected and trimmed reads.
--
--
-- BEGIN CORRECTION
-- Meryl finished successfully.  Kmer frequency histogram:
--
--  16-mers                                                                                           Fraction
--    Occurrences   NumMers                                                                         Unique Total
--       1-     1         0                                                                        0.0000 0.0000
--       2-     2  44065704 ********************************************************************** 0.4595 0.0710
--       3-     4  20511037 ********************************                                       0.6073 0.1053
--       5-     7   7609633 ************                                                           0.7103 0.1400
--       8-    11   4933827 *******                                                                0.7676 0.1702
--      12-    16   5030163 *******                                                                0.8152 0.2086
--      17-    22   4907756 *******                                                                0.8662 0.2675
--      23-    29   3541136 *****                                                                  0.9146 0.3434
--      30-    37   1864259 **                                                                     0.9482 0.4126
--      38-    46    879492 *                                                                      0.9656 0.4582
--      47-    56    521535                                                                        0.9740 0.4858
--      57-    67    366198                                                                        0.9792 0.5067
--      68-    79    275288                                                                        0.9829 0.5245
--      80-    92    212476                                                                        0.9856 0.5404
--      93-   106    171813                                                                        0.9878 0.5549
--     107-   121    137708                                                                        0.9896 0.5685
--     122-   137    116397                                                                        0.9910 0.5810
--     138-   154     98762                                                                        0.9922 0.5931
--     155-   172     83303                                                                        0.9932 0.6046
--     173-   191     70271                                                                        0.9940 0.6155
--     192-   211     59596                                                                        0.9948 0.6257
--     212-   232     50506                                                                        0.9954 0.6353
--     233-   254     43728                                                                        0.9959 0.6443
--     255-   277     37635                                                                        0.9964 0.6528
--     278-   301     31990                                                                        0.9967 0.6608
--     302-   326     27450                                                                        0.9971 0.6682
--     327-   352     23901                                                                        0.9974 0.6751
--     353-   379     20720                                                                        0.9976 0.6816
--     380-   407     18340                                                                        0.9978 0.6877
--     408-   436     16135                                                                        0.9980 0.6935
--     437-   466     14365                                                                        0.9982 0.6990
--     467-   497     13150                                                                        0.9983 0.7042
--     498-   529     12120                                                                        0.9985 0.7093
--     530-   562     10952                                                                        0.9986 0.7143
--     563-   596     10143                                                                        0.9987 0.7191
--     597-   631      8979                                                                        0.9988 0.7238
--     632-   667      7881                                                                        0.9989 0.7282
--     668-   704      6963                                                                        0.9990 0.7323
--     705-   742      6373                                                                        0.9991 0.7362
--     743-   781      5774                                                                        0.9991 0.7399
--     782-   821      5501                                                                        0.9992 0.7434
--
--
--           0 (max occurrences)
--  1241074229 (total mers, non-unique)
--    95901390 (distinct mers, non-unique)
--           0 (unique mers)
-- Finished stage 'meryl-process', reset canuIteration.
--
-- Removing meryl database 'correction/0-mercounts/myassembly.ms16'.
--
-- OVERLAPPER (mhap) (correction)
--
-- Set corMhapSensitivity=high based on read coverage of 26.92.
--
-- PARAMETERS: hashes=768, minMatches=2, threshold=0.73
--
-- Given 1.8 GB, can fit 450 reads per block.
-- For 2474 blocks, set stride to 618 blocks.
-- Logging partitioning to 'correction/1-overlapper/partitioning.log'.
mkdir correction/1-overlapper/queries/000795: Disk quota exceeded at /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/canu/2.2/bin/../lib/site_

It says disk quota exceeded so I think the issue is now from Computecanada?

I tried checking a log file under corrections by going to correction/1-overlapper/mhap.*.out but in the 1-overlapper directory I only found: partitioning.log queries

I am not sure why the run is canceling on its own though.

brianwalenz commented 2 years ago

It ran out of disk space, couldn't make a directory, and failed because of that.

I'm not at all familiar with Compute Canada, but docs at https://docs.computecanada.ca/wiki/Compute_Canada_Documentation, specifically https://docs.computecanada.ca/wiki/Scratch_purging_policy, hint there is a scratch space you could possibly use to generate the assembly, then copy the result back to your project space when it is done.

KarimAI7 commented 2 years ago

Hello,

I ran the file on scratch and canu seemed to run well but I go the following issue:

-- Running jobs.  First attempt out of 2.
--
-- Failed to submit compute jobs.  Delay 10 seconds and try again.

CRASH:
CRASH: canu 2.2
CRASH: Please panic, this is abnormal.
CRASH:
CRASH:   Failed to submit compute jobs.
CRASH:
CRASH: Failed at /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/canu/2.2/bin/../lib/site_perl/canu/Execution.pm line 1259.
CRASH:  canu::Execution::submitOrRunParallelJob("myassembly", "ovS", "correction/myassembly.ovlStore.BUILDING", "scripts/2-sort", 1, 2, 3, 4, ...) called at >
CRASH:  canu::OverlapStore::overlapStoreSorterCheck("correction", "myassembly", "cor", 157, 4181) called at /cvmfs/soft.computecanada.ca/easybuild/software/2>
CRASH:  canu::OverlapStore::createOverlapStore("myassembly", "cor") called at /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/canu/2.2/bin/can>
CRASH:  main::overlap("myassembly", "cor") called at /cvmfs/soft.computecanada.ca/easybuild/software/2020/avx2/Core/canu/2.2/bin/canu line 1079
CRASH:
CRASH: Last 50 lines of the relevant log file (correction/myassembly.ovlStore.BUILDING/scripts/2-sort.jobSubmit-01.out):
CRASH:
CRASH: sbatch: error: AssocMaxSubmitJobLimit
CRASH: sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)
CRASH:

I contacted ComputeCanada who informed me there is a limit of 1000 jobs on their cluster.
I am not sure how to adjust the parameters to accommodate this.

brianwalenz commented 2 years ago

There's discussion of this in issue #1883.

skoren commented 2 years ago

Idle, original issue w/runtime and space limits resolved. The job limit workaround is described in linked issue.