Memory allocation issue

psur9757 commented 4 years ago

My code keeping using 4GB RAM and getting killed even though I specify minMemory=16. Also gnuplot fails every time.

PBS Script

#!/bin/bash
#PBS -P OSR
#PBS -N canu1
#PBS -l select=1:ncpus=8:mem=64GB
#PBS -l walltime=36:00:00

module load canu/2.0
module load gnuplot/5.2.8
module load libgd/2.2.1

cd $outdir
canu -p CR -d canu1 genomeSize=800m corOutCoverage=200 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" gridOptions="-P OSR -l walltime=156:00:00" gridOptionsJobName=canu1 minMemory=16 corConcurrency=4 -pacbio-raw pacbio_subreads.fq

Canu.out

Found perl:
   /usr/bin/perl

Found java:
   /usr/bin/java
   openjdk version "1.8.0_151"

Found canu:
   /usr/local/canu/2.0/bin/canu
   Canu 2.0

-- Canu 2.0
--
-- Detected Java(TM) Runtime Environment '1.8.0_151' (from 'java') with -d64 support.
--
-- WARNING:
-- WARNING:  Failed to run gnuplot using command 'gnuplot'.
-- WARNING:  Plots will be disabled.
-- WARNING:
--
-- Detected 48 CPUs and 188 gigabytes of memory.
-- Detected PBSPro 'PBSPro_13.1.0.160576' with 'pbsnodes' binary in /usr/local/pbs/default/bin/pbsnodes.
-- Detecting PBSPro resources.
-- 
-- Found  49 hosts with  48 cores and  185 GB memory under PBSPro control.
-- Found  27 hosts with  36 cores and  187 GB memory under PBSPro control.
-- Found  80 hosts with  32 cores and  123 GB memory under PBSPro control.
-- Found   2 hosts with  24 cores and  502 GB memory under PBSPro control.
-- Found  58 hosts with  24 cores and  123 GB memory under PBSPro control.
-- Found   2 hosts with  24 cores and  250 GB memory under PBSPro control.
-- Found   3 hosts with  64 cores and 6057 GB memory under PBSPro control.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     24.000 GB    8 CPUs  (k-mer counting)
-- Grid:  hap       16.000 GB    8 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   30.000 GB    8 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    16.000 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl    16.000 GB    8 CPUs  (overlap detection)
-- Grid:  cor       16.000 GB    4 CPUs  (read correction)
-- Grid:  ovb       16.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       16.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       30.000 GB    8 CPUs  (read error detection)
-- Grid:  oea       16.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      256.000 GB   16 CPUs  (contig construction with bogart)
-- Grid:  cns       16.000 GB    8 CPUs  (consensus)
-- Grid:  gfa       32.000 GB   16 CPUs  (GFA alignment and processing)
--
-- In 'OSR.seqStore', found PacBio CLR reads:
--   PacBio CLR:               1
--
--   Raw:                      1
--
-- Generating assembly 'OSR' in '/scratch/RDS-FAE-OSR-RW/coffee_rust/1.assembly/2.results/canu1':
--    - correct raw reads.
--    - trim corrected reads.
--    - assemble corrected and trimmed reads.
--
-- Parameters:
--
--  genomeSize        800000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)
--
--
-- BEGIN CORRECTION
--
-- No change in report.
-- Meryl finished successfully.  Kmer frequency histogram:
--
-- WARNING: gnuplot failed.
--
----------------------------------------
/var/spool/PBS/mom_priv/jobs/4210659.pbsserver.SC: line 51: 201646 Killed                  /usr/bin/env perl $bin/canu -p 'OSR' 'genomeSize=800m' 'corOutCove
rage=200' 'batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50' 'gridOptions=-P OSR -l walltime=156:00:00' 'gridOptionsJobName=canu1' 'minMemory=16' 'corConcurrency=
4' -pacbio-raw '/scratch/RDS-FAE-OSR-RW/coffee_rust/1.assembly/1.data/pacbio_subreads.fq' canuIteration=1

Any help would be much appreciated.

skoren commented 4 years ago

I don't see 4GB in the canu configuration or your command anywhere, where did that value come from? I would not specify corConcurrency or even minMemory, your genome is large enough that it will use more memory anyway and the concurrency doesn't matter on the grid.

The job is likely getting killed due to the faulty gnuplot (which is what it was running right before being killed) to generate plots. Since the plots are optional, I would just not load gnuplot at all, let the command be unknown, and Canu will continue without plotting. You can just re-run your last command to resume the run.

psur9757 commented 4 years ago

I tried you suggestion and it still dies. I still get the gnuplot error. I know this script has worked with older versions 1.6-1.8.

PBS script

#!/bin/bash
#PBS -P OSR
#PBS -N canu1
#PBS -l select=1:ncpus=8:mem=64GB
#PBS -l walltime=36:00:00

module load canu/2.0

cd canu1
canu -p CR -d canu1 genomeSize=1.6g corOutCoverage=200 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" gridOptions="-P OSR -l walltime=156:00:00" gridOptionsJobName=canu1 -pacbio-raw pacbio_subreads.fq

Canu.out

Found perl:
   /usr/bin/perl

Found java:
   /usr/bin/java
   openjdk version "1.8.0_151"

Found canu:
   /usr/local/canu/2.0/bin/canu
   Canu 2.0

-- Canu 2.0
--
-- Detected Java(TM) Runtime Environment '1.8.0_151' (from 'java') with -d64 support.
--
-- WARNING:
-- WARNING:  Failed to run gnuplot using command 'gnuplot'.
-- WARNING:  Plots will be disabled.
-- WARNING:
--
-- Detected 48 CPUs and 188 gigabytes of memory.
-- Detected PBSPro 'PBSPro_13.1.0.160576' with 'pbsnodes' binary in /usr/local/pbs/default/bin/pbsnodes.
-- Detecting PBSPro resources.
-- 
-- Found  49 hosts with  48 cores and  185 GB memory under PBSPro control.
-- Found  27 hosts with  36 cores and  187 GB memory under PBSPro control.
-- Found  80 hosts with  32 cores and  123 GB memory under PBSPro control.
-- Found   2 hosts with  24 cores and  502 GB memory under PBSPro control.
-- Found  58 hosts with  24 cores and  123 GB memory under PBSPro control.
-- Found   2 hosts with  24 cores and  250 GB memory under PBSPro control.
-- Found   3 hosts with  64 cores and 6057 GB memory under PBSPro control.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     30.000 GB    8 CPUs  (k-mer counting)
-- Grid:  hap       16.000 GB   16 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   30.000 GB    8 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    16.000 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl    16.000 GB    8 CPUs  (overlap detection)
-- Grid:  cor       24.000 GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       30.000 GB    8 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      256.000 GB   16 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    8 CPUs  (consensus)
-- Grid:  gfa       32.000 GB   16 CPUs  (GFA alignment and processing)
--
-- In 'OSR.seqStore', found PacBio CLR reads:
--   PacBio CLR:               1
--
--   Raw:                      1
--
-- Generating assembly 'OSR' in '/scratch/RDS-FAE-OSR-RW/coffee_rust/1.assembly/2.results/canu1':
--    - correct raw reads.
--    - trim corrected reads.
--    - assemble corrected and trimmed reads.
--
-- Parameters:
--
--  genomeSize        1600000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)
--
--
-- BEGIN CORRECTION
--
-- No change in report.
-- Meryl finished successfully.  Kmer frequency histogram:
--
-- WARNING: gnuplot failed.
--
----------------------------------------
/var/spool/PBS/mom_priv/jobs/4220843.pbsserver.SC: line 51: 16739 Killed                  /usr/bin/env perl $bin/canu -p 'OSR' 'genomeSize=1.6g' 'corOutCover
age=200' 'batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50' 'gridOptions=-P OSR -l walltime=156:00:00' 'gridOptionsJobName=canu1' -pacbio-raw '/scratch/RDS-FAE-OS
R-RW/coffee_rust/1.assembly/1.data/pacbio_subreads.fq' canuIteration=1

I also checked with our HPC team regarding one of the fails with the first script and here is what they said:

You are not asking for enough memory.
The error on the node is:
4/22/2020 13:37:23;0008;pbs_python;Job;4204384.pbsserver;Cgroup memory limit exceeded: Killed process 401565, UID 596213, (perl) total-vm:5152684kB, anon-rss:4193824kB, file-rss:2088kB

Looks like you only asked for 4gb.

Associated command: canu -p CR -d canu1 genomeSize=800m corOutCoverage=200 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" gridOptions="-P OSR -l walltime=156:00:00" gridOptionsJobName=canu1 corConcurrency=4 -pacbio-raw pacbio_subreads.fq

skoren commented 4 years ago

Ah right, I didn't think about the executive itself requesting less memory than 16gb. The minimum memory is only applied to the steps in the pipeline not the executive itself which we should fix. You want executiveMemory=16 (or 32) which will also increase the grid request to 16gb.

psur9757 commented 4 years ago

So I updated script and ran into another error.

PBS Script:

#!/bin/bash
#PBS -P OSR
#PBS -N canu1
#PBS -l select=1:ncpus=8:mem=64GB
#PBS -l walltime=36:00:00

module load canu/2.0

canu -p OSR -d $outdir genomeSize=1.6g corOutCoverage=200 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" gridOptions="-P OSR -l walltime=156:00:00" gridOptionsJobName=canu1 executiveMemory=32 -pacbio-raw $input

Canu.out

Found perl:
   /usr/bin/perl

Found java:
   /usr/bin/java
   openjdk version "1.8.0_151"

Found canu:
   /usr/local/canu/2.0/bin/canu
   Canu 2.0

-- Canu 2.0
--
-- Detected Java(TM) Runtime Environment '1.8.0_151' (from 'java') with -d64 support.
--
-- WARNING:
-- WARNING:  Failed to run gnuplot using command 'gnuplot'.
-- WARNING:  Plots will be disabled.
-- WARNING:
--
-- Detected 64 CPUs and 6059 gigabytes of memory.
-- Detected PBSPro 'PBSPro_13.1.0.160576' with 'pbsnodes' binary in /usr/local/pbs/default/bin/pbsnodes.
-- Detecting PBSPro resources.
-- 
-- Found  49 hosts with  48 cores and  185 GB memory under PBSPro control.
-- Found  27 hosts with  36 cores and  187 GB memory under PBSPro control.
-- Found  80 hosts with  32 cores and  123 GB memory under PBSPro control.
-- Found   2 hosts with  24 cores and  502 GB memory under PBSPro control.
-- Found  58 hosts with  24 cores and  123 GB memory under PBSPro control.
-- Found   2 hosts with  24 cores and  250 GB memory under PBSPro control.
-- Found   3 hosts with  64 cores and 6057 GB memory under PBSPro control.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     30.000 GB    8 CPUs  (k-mer counting)
-- Grid:  hap       16.000 GB   16 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   30.000 GB    8 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    16.000 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl    16.000 GB    8 CPUs  (overlap detection)
-- Grid:  cor       24.000 GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       30.000 GB    8 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      256.000 GB   16 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    8 CPUs  (consensus)
-- Grid:  gfa       32.000 GB   16 CPUs  (GFA alignment and processing)
--
-- In 'OSR.seqStore', found PacBio CLR reads:
--   PacBio CLR:               1
--
--   Raw:                      1
--
-- Generating assembly 'OSR' in '/scratch/RDS-FAE-OSR-RW/coffee_rust/1.assembly/2.results/canu1':
--    - correct raw reads.
--    - trim corrected reads.
--    - assemble corrected and trimmed reads.
--
-- Parameters:
--
--  genomeSize        1600000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)
--
--
-- BEGIN CORRECTION
--
-- No change in report.
--
-- OVERLAPPER (mhap) (correction) complete, not rewriting scripts.
--
--
-- Mhap overlap jobs failed, tried 2 times, giving up.
--   job correction/1-overlapper/results/000042.ovb FAILED.
--   job correction/1-overlapper/results/000052.ovb FAILED.
--   job correction/1-overlapper/results/000054.ovb FAILED.
--   job correction/1-overlapper/results/000055.ovb FAILED.
--   job correction/1-overlapper/results/000056.ovb FAILED.
--   job correction/1-overlapper/results/000060.ovb FAILED.
--   job correction/1-overlapper/results/000062.ovb FAILED.
--   job correction/1-overlapper/results/000063.ovb FAILED.
--   job correction/1-overlapper/results/000064.ovb FAILED.
--   job correction/1-overlapper/results/000065.ovb FAILED.
--   job correction/1-overlapper/results/000066.ovb FAILED.
--   job correction/1-overlapper/results/000067.ovb FAILED.
--   job correction/1-overlapper/results/000068.ovb FAILED.
--   job correction/1-overlapper/results/000069.ovb FAILED.
--   job correction/1-overlapper/results/000070.ovb FAILED.
--   job correction/1-overlapper/results/000071.ovb FAILED.
--   job correction/1-overlapper/results/000072.ovb FAILED.
--   job correction/1-overlapper/results/000073.ovb FAILED.
--   job correction/1-overlapper/results/000074.ovb FAILED.
--   job correction/1-overlapper/results/000075.ovb FAILED.
--   job correction/1-overlapper/results/000076.ovb FAILED.
--   job correction/1-overlapper/results/000090.ovb FAILED.
--   job correction/1-overlapper/results/000092.ovb FAILED.
--   job correction/1-overlapper/results/000098.ovb FAILED.
--

ABORT:
ABORT: Canu 2.0
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

What parameters do I need to change? Thank you!!

skoren commented 4 years ago

This is usually an issue with the JVM configuration on your cluster, it can be configured to have extra overhead that isn't present on our test clusters. You could explicitly increase the memory for the JVM job (gridOptionsMhap="-l select=1:ncpus=8:mem=64GB") but it would be good to investigate why it's using more memory first. What is the error from one of the failing files (correction/1-overlapper/*98*.out for example)?

psur9757 commented 4 years ago

Is this the correct file?

$ tail -50 correction/1-overlapper/*.98

Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
Current # sequences loaded and processed from file: 20000...
Current # sequences loaded and processed from file: 25000...
Current # sequences loaded and processed from file: 30000...
Current # sequences loaded and processed from file: 35000...
Current # sequences loaded and processed from file: 40000...
Current # sequences loaded and processed from file: 45000...
Current # sequences loaded and processed from file: 50000...
Current # sequences loaded and processed from file: 55000...
Current # sequences loaded and processed from file: 60000...
Current # sequences loaded and processed from file: 65000...
Current # sequences loaded and processed from file: 70000...
Current # sequences loaded and processed from file: 75000...
Current # sequences loaded and processed from file: 80000...
Processed 81000 to sequences.
Time (s) to score, hash to-file, and output: 465.81968653200005
Opened fasta file /scratch/RDS-FAE-OSR-RW/coffee_rust/1.assembly/2.results/canu1/correction/1-overlapper/blocks/000069.dat.
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
Current # sequences loaded and processed from file: 20000...
Current # sequences loaded and processed from file: 25000...
Current # sequences loaded and processed from file: 30000...
Current # sequences loaded and processed from file: 35000...
Current # sequences loaded and processed from file: 40000...
Current # sequences loaded and processed from file: 45000...
Current # sequences loaded and processed from file: 50000...
Current # sequences loaded and processed from file: 55000...
Current # sequences loaded and processed from file: 60000...
Current # sequences loaded and processed from file: 65000...
Current # sequences loaded and processed from file: 70000...
Current # sequences loaded and processed from file: 75000...
Current # sequences loaded and processed from file: 80000...
Processed 81000 to sequences.
Time (s) to score, hash to-file, and output: 727.9392699240001
Total scoring time (s): 9608.808452318
Total time (s): 9624.379266296
MinHash search time (s): 6142.355619701
Total matches found: 864643020
Average number of matches per lookup: 485.2093265993266
Average number of table elements processed per lookup: 13733.510815375983
Average number of table elements processed per match: 28.304300973828482
Average % of hashed sequences hit per lookup: 6.272848686106608
Average % of hashed sequences hit that are matches: 4.774735450943637
Average % of hashed sequences fully compared that are matches: 72.42336803473859
sqStore_loadMetadata()-- Using 'raw' 0x01 reads.
mhapConvert: mhap/mhapConvert.C:122: int main(int, char**): Assertion `W.toint32(10) <= W.toint32(11)' failed.
/scratch/RDS-FAE-OSR-RW/coffee_rust/1.assembly/2.results/canu1/correction/1-overlapper/mhap.sh: line 1485: 360055 Aborted                 (core dumped) $bin/mhapConvert -S ../../OSR.seqStore -o ./results/$qry.mhap.ovb.WORKING $outPath/$qry.mhap

skoren commented 4 years ago

That looks like a disk corruption/space issue. I expect the input file (/scratch/RDS-FAE-OSR-RW/coffee_rust/1.assembly/2.results/canu1/correction/1-overlapper/results/*98*mhap) is truncated or corrupted. Did this job get killed by your scheduler or was there some other error on the system when it was running? Are you out of quota disk space? I expect out of space is the likeliest explanation.

If you confirm the above aren't an issue (not out of space/not getting killed), you can remove the intermediate bad files (*.mhap and *.WORKING in the correction/1-overlapper/results folder) and re-run the original command.

psur9757 commented 4 years ago

HPC team confirmed that scratch has become low on space. So I deleted the intermediate bad files and restarted the command.

psur9757 commented 4 years ago

The updated script ran into another error, I cannot figure it out:

Canu.out

Found perl:
   /usr/bin/perl

Found java:
   /usr/bin/java
   openjdk version "1.8.0_151"

Found canu:
   /usr/local/canu/2.0/bin/canu
   Canu 2.0

-- Canu 2.0
--
-- Detected Java(TM) Runtime Environment '1.8.0_151' (from 'java') with -d64 support.
--
-- WARNING:
-- WARNING:  Failed to run gnuplot using command 'gnuplot'.
-- WARNING:  Plots will be disabled.
-- WARNING:
--
-- Detected 64 CPUs and 6059 gigabytes of memory.
-- Detected PBSPro 'PBSPro_13.1.0.160576' with 'pbsnodes' binary in /usr/local/pbs/default/bin/pbsnodes.
-- Detecting PBSPro resources.
-- 
-- Found  49 hosts with  48 cores and  185 GB memory under PBSPro control.
-- Found  27 hosts with  36 cores and  187 GB memory under PBSPro control.
-- Found  80 hosts with  32 cores and  123 GB memory under PBSPro control.
-- Found   2 hosts with  24 cores and  502 GB memory under PBSPro control.
-- Found  58 hosts with  24 cores and  123 GB memory under PBSPro control.
-- Found   2 hosts with  24 cores and  250 GB memory under PBSPro control.
-- Found   3 hosts with  64 cores and 6057 GB memory under PBSPro control.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     30.000 GB    8 CPUs  (k-mer counting)
-- Grid:  hap       16.000 GB   16 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   30.000 GB    8 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    16.000 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl    16.000 GB    8 CPUs  (overlap detection)
-- Grid:  cor       24.000 GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       30.000 GB    8 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      256.000 GB   16 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    8 CPUs  (consensus)
-- Grid:  gfa       32.000 GB   16 CPUs  (GFA alignment and processing)
--
-- In 'OSR.seqStore', found PacBio CLR reads:
--   PacBio CLR:               1
--
--   Raw:                      1
--
-- Generating assembly 'OSR' in '/scratch/RDS-FAE-OSR-RW/coffee_rust/1.assembly/2.results/canu1':
--    - correct raw reads.
--    - trim corrected reads.
--    - assemble corrected and trimmed reads.
--
-- Parameters:
--
--  genomeSize        1600000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)
--
--
-- BEGIN CORRECTION
--
-- No change in report.
--
-- Creating overlap store correction/OSR.ovlStore using:
--    226 buckets
--    238 slices
--        using at most 29 GB memory each
-- Overlap store sorter finished.
-- No change in report.
-- Finished stage 'cor-overlapStoreSorterCheck', reset canuIteration.
----------------------------------------
-- Starting command on Sat May  9 11:34:19 2020 with 7755.833 GB free disk space

    cd correction
    /usr/local/canu/2.0/bin/ovStoreIndexer \
      -O  ./OSR.ovlStore.BUILDING \
      -S ../OSR.seqStore \
      -C  ./OSR.ovlStore.config \
      -delete \
    > ./OSR.ovlStore.BUILDING.index.err 2>&1

-- Finished on Sat May  9 11:39:36 2020 (317 seconds) with 7748.516 GB free disk space
----------------------------------------
-- Checking store.
----------------------------------------
-- Starting command on Sat May  9 11:39:36 2020 with 7748.516 GB free disk space

    cd correction
    /usr/local/canu/2.0/bin/ovStoreDump \
     -S ../OSR.seqStore \
     -O  ./OSR.ovlStore \
     -counts \
     > ./OSR.ovlStore/counts.dat 2> ./OSR.ovlStore/counts.err

-- Finished on Sat May  9 11:39:41 2020 (5 seconds) with 7769.224 GB free disk space
----------------------------------------
--
-- Overlap store 'correction/OSR.ovlStore' successfully constructed.
-- Found 293773456141 overlaps for 7117369 reads; 95252 reads have no overlaps.
--
--
-- Purged 2908.812 GB in 542 overlap output files.
-- No change in report.
-- Finished stage 'cor-createOverlapStore', reset canuIteration.
-- Set corMinCoverage=4 based on read coverage of 103.
-- Global filter scores will be estimated.
-- Computing correction layouts.
----------------------------------------
-- Starting command on Sat May  9 11:39:56 2020 with 7918.803 GB free disk space

    cd correction
    /usr/local/canu/2.0/bin/generateCorrectionLayouts \
      -S ../OSR.seqStore \
      -O  ./OSR.ovlStore \
      -C  ./OSR.corStore.WORKING \
      -eC 400 \
    > ./OSR.corStore.err 2>&1
sh: line 5: 51781 Aborted                 (core dumped) /usr/local/canu/2.0/bin/generateCorrectionLayouts -S ../OSR.seqStore -O ./OSR.ovlStore -C ./OSR.corSt
ore.WORKING -eC 400 > ./OSR.corStore.err 2>&1

-- Finished on Sat May  9 23:55:52 2020 (44156 seconds) with 9260.643 GB free disk space
----------------------------------------

ERROR:
ERROR:  Failed with exit code 134.  (rc=34304)
ERROR:

ABORT:
ABORT: Canu 2.0
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT:   failed to generate correction layouts.
ABORT:
ABORT: Disk space available:  9260.643 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (correction/OSR.corStore.err):
ABORT:
ABORT:   sqStore_loadMetadata()-- Using 'raw' 0x01 reads.
ABORT:   ERROR: bogus overlap '   5489659     216378  I       0       0  11399  4294951323  18432  0.187000'
ABORT:   generateCorrectionLayouts: correction/generateCorrectionLayouts.C:114: void generateLayout(tgTig*, uint16*, uint32, double, double, ovOverlap*, uint
32, FILE*): Assertion `ovlLength < (((uint32)1 << 21) - 1)' failed.
ABORT:   
ABORT:   Failed with 'Aborted'; backtrace (libbacktrace):
ABORT:   utility/system-stackTrace.C::89 in _Z17AS_UTL_catchCrashiP7siginfoPv()
ABORT:   (null)::0 in (null)()
ABORT:   (null)::0 in (null)()
ABORT:   (null)::0 in (null)()
ABORT:   (null)::0 in (null)()
ABORT:   (null)::0 in (null)()
ABORT:   correction/generateCorrectionLayouts.C::114 in generateLayout()
ABORT:   correction/generateCorrectionLayouts.C::351 in main()
ABORT:   (null)::0 in (null)()
ABORT:   (null)::0 in (null)()
ABORT:

Canu.pbs

#!/bin/bash
#PBS -P OSR
#PBS -N canu1
#PBS -l select=1:ncpus=8:mem=64GB
#PBS -l walltime=36:00:00

module load canu/2.0

canu -p OSR -d $outdir genomeSize=1.6g corOutCoverage=200 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" gridOptions="-P OSR -l walltime=156:00:00" gridOptionsMhap="-l select=1:ncpus=8:mem=64GB" gridOptionsJobName=canu1 executiveMemory=32 -pacbio-raw $input

skoren commented 4 years ago

I think this is still the result of disk corruption. The store gets checked right after building which implies it was OK and then was corrupted at some latter point. I'd suggest checking with your IT group if your FS was experiencing heavy load (due to canu or not) and seeing if they have any suggestions to help mitigate these errors in the future. My concern is you'll keep hitting these types of errors throughout your run.

Unfortunately it isn't so easy to fix disk corruption errors like this. The easiest option for you, if you're able to, is to modify the code to ignore the bad overlaps and continue on rather than failing. Hopefully not too many overlaps are corrupt in this way so it would have minimal effects on the run. You'd have to modify (I can share a binary if you prefer with these modifications):

110     if (ovlLength > AS_MAX_READLEN) {
111       char ovlString[1024];
112       fprintf(stderr, "ERROR: bogus overlap '%s'\n", ovl[oo].toString(ovlString, ovOverlapAsCoords, false));
113     }

to

110     if (ovlLength > AS_MAX_READLEN) {
111       char ovlString[1024];
112       fprintf(stderr, "ERROR: bogus overlap '%s'\n", ovl[oo].toString(ovlString, ovOverlapAsCoords, false));
+          continue;
113     }

psur9757 commented 4 years ago

I talked to the HPC and they asked me to start a fresh job once they had sorted a space issue. They insist there is no disk corruption. I did the restart and got an error again. Is it the same issue? It seems the same to me.

Error:

Found perl:
   /usr/bin/perl

Found java:
   /usr/bin/java
   openjdk version "1.8.0_151"

Found canu:
   /usr/local/canu/2.0/bin/canu
   Canu 2.0

-- Canu 2.0
--
-- Detected Java(TM) Runtime Environment '1.8.0_151' (from 'java') with -d64 support.
--
-- WARNING:
-- WARNING:  Failed to run gnuplot using command 'gnuplot'.
-- WARNING:  Plots will be disabled.
-- WARNING:
--
-- Detected 64 CPUs and 6059 gigabytes of memory.
-- Detected PBSPro 'PBSPro_13.1.0.160576' with 'pbsnodes' binary in /usr/local/pbs/default/bin/pbsnodes.
-- Detecting PBSPro resources.
-- 
-- Found  49 hosts with  48 cores and  185 GB memory under PBSPro control.
-- Found  27 hosts with  36 cores and  187 GB memory under PBSPro control.
-- Found  80 hosts with  32 cores and  123 GB memory under PBSPro control.
-- Found   2 hosts with  24 cores and  502 GB memory under PBSPro control.
-- Found  58 hosts with  24 cores and  123 GB memory under PBSPro control.
-- Found   2 hosts with  24 cores and  250 GB memory under PBSPro control.
-- Found   3 hosts with  64 cores and 6057 GB memory under PBSPro control.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     30.000 GB    8 CPUs  (k-mer counting)
-- Grid:  hap       16.000 GB   16 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   30.000 GB    8 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    16.000 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl    16.000 GB    8 CPUs  (overlap detection)
-- Grid:  cor       24.000 GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       30.000 GB    8 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      256.000 GB   16 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    8 CPUs  (consensus)
-- Grid:  gfa       32.000 GB   16 CPUs  (GFA alignment and processing)
--
-- In 'CR.seqStore', found PacBio CLR reads:
--   PacBio CLR:               1
--
--   Raw:                      1
--
-- Generating assembly 'CR' in '/scratch/RDS-FAE-OSR-RW/coffee_rust/1.assembly/2.results/canu2':
--    - correct raw reads.
--    - trim corrected reads.
--    - assemble corrected and trimmed reads.
--
-- Parameters:
--
--  genomeSize        1600000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)
--
--
-- BEGIN CORRECTION
--
-- No change in report.
--
-- Creating overlap store correction/CR.ovlStore using:
--    226 buckets
--    238 slices
--        using at most 29 GB memory each
-- Overlap store sorter finished.
-- No change in report.
-- Finished stage 'cor-overlapStoreSorterCheck', reset canuIteration.
----------------------------------------
-- Starting command on Wed May 20 23:11:56 2020 with 10652.791 GB free disk space

    cd correction
    /usr/local/canu/2.0/bin/ovStoreIndexer \
      -O  ./CR.ovlStore.BUILDING \
      -S ../CR.seqStore \
      -C  ./CR.ovlStore.config \
      -delete \
    > ./CR.ovlStore.BUILDING.index.err 2>&1

-- Finished on Wed May 20 23:20:41 2020 (525 seconds) with 10591.13 GB free disk space
----------------------------------------
-- Checking store.
----------------------------------------
-- Starting command on Wed May 20 23:20:41 2020 with 10591.13 GB free disk space

    cd correction
    /usr/local/canu/2.0/bin/ovStoreDump \
     -S ../CR.seqStore \
     -O  ./CR.ovlStore \
     -counts \
     > ./CR.ovlStore/counts.dat 2> ./CR.ovlStore/counts.err

-- Finished on Wed May 20 23:20:43 2020 (2 seconds) with 10597.453 GB free disk space
----------------------------------------
--
-- Overlap store 'correction/CR.ovlStore' successfully constructed.
-- Found 294084131436 overlaps for 7117440 reads; 95181 reads have no overlaps.
--
--
-- Purged 2908.811 GB in 542 overlap output files.
-- No change in report.
-- Finished stage 'cor-createOverlapStore', reset canuIteration.
-- Set corMinCoverage=4 based on read coverage of 103.
-- Global filter scores will be estimated.
-- Computing correction layouts.
----------------------------------------
-- Starting command on Wed May 20 23:20:54 2020 with 10624.02 GB free disk space

    cd correction
    /usr/local/canu/2.0/bin/generateCorrectionLayouts \
      -S ../CR.seqStore \
      -O  ./CR.ovlStore \
      -C  ./CR.corStore.WORKING \
      -eC 400 \
    > ./CR.corStore.err 2>&1
sh: line 5: 70144 Aborted                 (core dumped) /usr/local/canu/2.0/bin/generateCorrectionLayouts -S ../CR.seqStore -O ./CR.ovlStore -C ./CR.corStore
.WORKING -eC 400 > ./CR.corStore.err 2>&1

-- Finished on Thu May 21 06:37:48 2020 (26214 seconds) with 12579.825 GB free disk space
----------------------------------------

ERROR:
ERROR:  Failed with exit code 134.  (rc=34304)
ERROR:

ABORT:
ABORT: Canu 2.0
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT:   failed to generate correction layouts.
ABORT:
ABORT: Disk space available:  12579.825 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (correction/CR.corStore.err):
ABORT:
ABORT:   sqStore_loadMetadata()-- Using 'raw' 0x01 reads.
ABORT:   generateCorrectionLayouts: stores/sqStore.H:399: uint32 sqStore::sqStore_getReadLength(uint32, sqRead_which): Assertion `id > 0' failed.
ABORT:   
ABORT:   Failed with 'Aborted'; backtrace (libbacktrace):
ABORT:   utility/system-stackTrace.C::89 in _Z17AS_UTL_catchCrashiP7siginfoPv()
ABORT:   (null)::0 in (null)()
ABORT:   (null)::0 in (null)()
ABORT:   (null)::0 in (null)()
ABORT:   (null)::0 in (null)()
ABORT:   (null)::0 in (null)()
ABORT:   stores/sqStore.H::399 in _ZN7sqStore21sqStore_getReadLengthEjj()
ABORT:   stores/ovOverlap.H::205 in _ZNK9ovOverlap5b_lenEj()
ABORT:   correction/generateCorrectionLayouts.C::107 in generateLayout()
ABORT:   correction/generateCorrectionLayouts.C::351 in main()
ABORT:   (null)::0 in (null)()
ABORT:   (null)::0 in (null)()
ABORT:

skoren commented 4 years ago

This still looks like disk corruption (or at least 0-bytes where Canu didn't expect there to be 0s). The easiest option now that the space issue is resolved is to erase the overlap store and let canu re-compute overlaps. If it took a very long time to get to this point in the assembly, the code workaround I suggested may be faster.

skoren commented 4 years ago

Idle, seemed to be caused by out of space/disk corruption.

marbl / canu

Memory allocation issue #1693