marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
656 stars 179 forks source link

exeeding memory - cannot trace back #624

Closed dominik-handler closed 6 years ago

dominik-handler commented 7 years ago

Hi,

Im am running into some memory issues running Canu. The log-file of canu.03.out looks the following:

-- Canu snapshot v1.6 +25 changes (r8437 c8f31513fb9bf28abc902ba2a621349e4ec02bd6)
--
-- CITATIONS
--
-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
-- Genome Res. 2017 May;27(5):722-736.
-- http://doi.org/10.1101/gr.215087.116
-- 
-- Read and contig alignments during correction, consensus and GFA building use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
-- 
--   Li H.
--   Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences.
--   Bioinformatics. 2016 Jul 15;32(14):2103-10.
--   http://doi.org/10.1093/bioinformatics/btw152
-- 
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_121' (from '/software/170828/software/java/1.8.0_121/bin/java').
-- Detected 56 CPUs and 189 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /cm/shared/apps/slurm/17.02.7/bin/sinfo.
-- Detected Slurm with 'MaxArraySize' limited to 1000 jobs.
-- 
-- Found   1 host  with  28 cores and  188 GB memory under Slurm control.
-- Found   3 hosts with   1 core  and    0 GB memory under Slurm control.
-- Found   5 hosts with  28 cores and  188 GB memory under Slurm control.
-- Found   4 hosts with  28 cores and  755 GB memory under Slurm control.
-- Found   1 host  with  28 cores and    0 GB memory under Slurm control.
-- Found  13 hosts with  28 cores and  125 GB memory under Slurm control.
--
--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl     62 GB   14 CPUs  (k-mer counting)
-- Grid:  cormhap   30 GB   14 CPUs  (overlap detection with mhap)
-- Grid:  obtmhap   20 GB   14 CPUs  (overlap detection with mhap)
-- Grid:  utgmhap   20 GB   14 CPUs  (overlap detection with mhap)
-- Grid:  cor       30 GB    4 CPUs  (read correction)
-- Grid:  ovb       20 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       20 GB    1 CPU   (overlap store sorting)
-- Grid:  red       30 GB    6 CPUs  (read error detection)
-- Grid:  oea       20 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       31 GB    7 CPUs  (contig construction)
-- Grid:  cns       31 GB    7 CPUs  (consensus)
-- Grid:  gfa       20 GB    7 CPUs  (GFA alignment and processing)
--
-- Found Nanopore uncorrected reads in 'correction/V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.gkpStore'.
--
-- Generating assembly 'V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes' in 'ALBACORE_2.0.2_V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes/canu-V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes'
--
-- Parameters:
--
--  genomeSize        150000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.3200 ( 32.00%)
--    obtOvlErrorRate 0.2250 ( 22.50%)
--    utgOvlErrorRate 0.2250 ( 22.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.5000 ( 50.00%)
--    obtErrorRate    0.2250 ( 22.50%)
--    utgErrorRate    0.2250 ( 22.50%)
--    cnsErrorRate    0.2250 ( 22.50%)
--
--
-- BEGIN CORRECTION
--
-- Found 3 mhap overlap output files.
-- Finished stage 'cor-mhapCheck', reset canuIteration.
----------------------------------------
-- Starting command on Mon Sep 11 15:38:31 2017 with 575858.947 GB free disk space

    cd correction
    canu/Linux-amd64/bin/ovStoreBuild \
     -O ./V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.ovlStore.BUILDING \
     -G ./V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.gkpStore \
     -M 20-20 \
     -L ./1-overlapper/ovljob.files \
     > ./V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.ovlStore.err 2>&1

-- Finished on Mon Sep 11 15:42:37 2017 (246 seconds) with 575813.374 GB free disk space
----------------------------------------
-- Checking store.
----------------------------------------
-- Starting command on Mon Sep 11 15:42:37 2017 with 575813.364 GB free disk space

    cd correction
    canu/Linux-amd64/bin/ovStoreDump \
     -G ./V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.gkpStore \
     -O ./V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.ovlStore \
     -d -counts \
     > ./V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.ovlStore/counts.dat 2> ./V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.ovlStore/counts.err

-- Finished on Mon Sep 11 15:42:37 2017 (lickety-split) with 575813.327 GB free disk space
----------------------------------------
--
-- Overlap store 'correction/V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.ovlStore' successfully constructed.
-- Found 432481698 overlaps for 223330 reads; 774 reads have no overlaps.
--
FILE correction/1-overlapper/results/000001.ovb
FILE correction/1-overlapper/results/000002.ovb
FILE correction/1-overlapper/results/000003.ovb
FILE correction/1-overlapper/results/000001.counts
FILE correction/1-overlapper/results/000002.counts
FILE correction/1-overlapper/results/000003.counts
DIRECTORY correction/1-overlapper/queries
DIRECTORY correction/1-overlapper/queries/000002
FILE correction/1-overlapper/queries/000002/000003.dat
DIRECTORY correction/1-overlapper/queries/000003
DIRECTORY correction/1-overlapper/queries/000001
FILE correction/1-overlapper/queries/000001/000003.dat
FILE correction/1-overlapper/queries/000001/000002.dat
FILE correction/1-overlapper/blocks/000001.dat
FILE correction/1-overlapper/blocks/000002.dat
FILE correction/1-overlapper/blocks/000003.dat
REMOVE TREE correction/1-overlapper/queries/000003
REMOVE TREE correction/1-overlapper/queries
REMOVE TREE correction/1-overlapper/queries/000002
REMOVE TREE correction/1-overlapper/queries/000001
--
-- Purged 11.466 GB in 12 overlap output files.
-- Overlap store 'correction/V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.ovlStore' statistics not available (skipped in correction and trimming stages).
--
-- Finished stage 'cor-createOverlapStore', reset canuIteration.
-- Computing global filter scores 'correction/2-correction/V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.globalScores'.
----------------------------------------
-- Starting command on Mon Sep 11 15:42:38 2017 with 575821.126 GB free disk space

    cd correction/2-correction
   canu/Linux-amd64/bin/filterCorrectionOverlaps \
      -estimate -nolog \
      -G ../V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.gkpStore \
      -O ../V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.ovlStore \
      -S ./V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.globalScores.WORKING \
      -c 40 \
      -l 0 \
      -e 0.22 \
    > ./V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.globalScores.err 2>&1

-- Finished on Mon Sep 11 15:42:38 2017 (lickety-split) with 575821.104 GB free disk space
----------------------------------------
--  PARAMETERS:
--  ----------
--  
--       40 (expected coverage)
--        0 (don't use overlaps shorter than this)
--    0.000 (don't use overlaps with erate less than this)
--    0.220 (don't use overlaps with erate more than this)
--  
--  OVERLAPS:
--  --------
--  
--  IGNORED:
--  
--             0 (< 0.0000 fraction error)
--             0 (> 0.2200 fraction error)
--             0 (< 0 bases long)
--             0 (> 2097151 bases long)
--  
--  FILTERED: (estimated)
--  
--     423636556 (too many overlaps, discard these shortest ones)
--  
--  EVIDENCE: (estimated)
--  
--       8845142 (longest overlaps)
--  
--  TOTAL:
--  
--     432481698 (all overlaps)
--  
--  READS: (estimated)
--  -----
--  
--           775 (no overlaps)
--          4447 (no overlaps filtered)
--         26239 (<=  50% overlaps filtered)
--        120018 (<=  80% overlaps filtered)
--        142655 (<=  95% overlaps filtered)
--        223330 (<= 100% overlaps filtered)
--  
-- Computing expected corrected read lengths 'correction/2-correction/V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.estimate.log'.
----------------------------------------
-- Starting command on Mon Sep 11 15:42:39 2017 with 575821.041 GB free disk space

    cd correction/2-correction
    canu/Linux-amd64/bin/generateCorrectionLayouts \
      -G ../V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.gkpStore \
      -O ../V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.ovlStore \
      -S ./V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.globalScores \
      -eE 0.22 \
      -ec 0 \
      -eC 80 \
      -p ./V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.estimate.WORKING

-- Finished on Mon Sep 11 15:43:09 2017 (30 seconds) with 575817.877 GB free disk space
----------------------------------------
-- Sorting reads by expected corrected length.
-- Sorting reads by uncorrected length.
-- Loading expected corrected read lengths.
-- Picking longest corrected reads.
-- Writing longest corrected reads to 'correction/2-correction/V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes.readsToCorrect'.
-- Summarizing filter.
--
-- Reads to be corrected:
--   83531 reads longer than 47329 bp
--   6047427698 bp
-- Expected corrected reads:
--   83531 reads
--   6000013346 bp
--   45809 bp minimum length
--   71830 bp mean length
--   272251 bp n50 length
--
-- WARNING: gnuplot failed; no plots will appear in HTML output.
--
----------------------------------------
-- Using overlaps no worse than 0.5 fraction error for correcting reads (from corErrorRate parameter).
-- Finished stage 'cor-buildCorrectionLayouts', reset canuIteration.
--
-- Running jobs.  First attempt out of 2.
--
-- 'correctReads.jobSubmit-01.sh' -> job 12278 tasks 1-23.
--
----------------------------------------
-- Starting command on Mon Sep 11 15:43:13 2017 with 575816.947 GB free disk space

    cd ALBACORE_2.0.2_V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes/canu-V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes
    sbatch \
      --depend=afterany:12278 \
      --mem-per-cpu=20g \
      --cpus-per-task=1   \
      -D `pwd` \
      -J 'canu_V4_FiltLong-lengthBIAS-5_canu-smash-haplotypes' \
      -o canu-scripts/canu.04.out canu-scripts/canu.04.sh
Submitted batch job 12279

-- Finished on Mon Sep 11 15:43:13 2017 (lickety-split) with 575816.943 GB free disk space
----------------------------------------
slurmstepd: error: Exceeded step memory limit at some point.
slurmstepd: error: Exceeded job memory limit at some point.

I am not able to trace back the point where it exeeded memory limits. According to the slurm history it nowhere reached the limits close. Also there is no error message in any of the compute jobs.

As you can see I already tried to bump up the minMemory to 20G and still get the same issue. Do you have any suggestion?

Thank you, Dominik

brianwalenz commented 7 years ago

The log looks like it finished successfully, but just warned it exceeded memory limit. It might be the ovStoreBuild command that was told to use up to 20gb memory - but this finished successfully.

Did canu stop running? It submitted the next compute (job 12278) and the next step in the pipeline (12279).

dominik-handler commented 7 years ago

It contienues, but last time (before I set the general 20G minMemory) one of the subsequent array-jobs failed. This try is still running at the moment. I will wait for the result and come back to you, if it fails again.

Dominik

skoren commented 6 years ago

Inactive, last comment indicates it was running, closing.