canu 1.6 Overlap store sorting jobs failed, retry

dibyendukumar commented 6 years ago

Hi, I am trying to run CANU 1.6 on a large genome (2.5 gb) with ~58x Sequel data. I am facing problem at correction stage. Could figure out why it is failing repeated several times. CANU failed at 'Sequel.ovlStore.BUILDING/1001' but there is no log file for '1001', Nothing in canu-logs, canu-scripts, /correction/Sequel.ovlStore.BUILDING/logs or scripts

In the Sequel.ovlStore.BUILDING step, it creats six array jobs (ovS_), each having 1000 jobs. Program right after finishing first set of 1000.

Sequel.ovlStore.BUILDING folder has following files/folders 1 to 1000 (evalueLen, index, info), 1-bucketize.success, 737 buckets, config, config.err, logs and scripts.

Sequel.ovlStore.BUILDING/logs has 2 types of files 1-bucketize..out- (737 files) and 2-sort..out- (1000 files).

Let me know if you need any thing else...

Please help... Thanks, Dibyendu

======================= Most recent CANU out (canu.20.out)

Canu 1.6
--
-- CITATIONS
--
-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
-- Genome Res. 2017 May;27(5):722-736.
-- http://doi.org/10.1101/gr.215087.116
-- 
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_141' (from 'java').
-- Detected gnuplot version '4.2 patchlevel 6 ' (from 'gnuplot') and image format 'png'.
-- Detected 24 CPUs and 47 gigabytes of memory.
-- Detected PBS/Torque '' with 'pbsnodes' binary in /usr/bin/pbsnodes.
-- Detecting PBS/Torque resources.
-- 
-- Found   2 hosts with  24 cores and   47 GB memory under PBS/Torque control.
-- Found   3 hosts with  24 cores and  126 GB memory under PBS/Torque control.
-- Found   4 hosts with  64 cores and  504 GB memory under PBS/Torque control.
--
--                     (tag)Threads
--            (tag)Memory         |
--        (tag)         |         |  algorithm
--        -------  ------  --------  -----------------------------
-- Grid:  meryl    126 GB   21 CPUs  (k-mer counting)
-- Grid:  cormhap   23 GB   12 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    15 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl    15 GB    8 CPUs  (overlap detection)
-- Grid:  cor       11 GB    4 CPUs  (read correction)
-- Grid:  ovb        2 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs      300 GB    1 CPU   (overlap store sorting)
-- Grid:  red       15 GB    8 CPUs  (read error detection)
-- Grid:  oea        4 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      252 GB   32 CPUs  (contig construction)
-- Grid:  cns       72 GB    8 CPUs  (consensus)
-- Grid:  gfa       23 GB   12 CPUs  (GFA alignment and processing)
--
-- Found PacBio uncorrected reads in 'correction/Sequel.gkpStore'.
--
-- Generating assembly 'Sequel' in '/ingens/genomics_work/dk_genomics/w22/canu/W22'
--
-- Parameters:
--
--  genomeSize        2300000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)
--
--
-- BEGIN CORRECTION
--
--
-- Overlap store sorting jobs failed, tried 2 times, giving up.
--   job correction/Sequel.ovlStore.BUILDING/1001 FAILED.
--   job correction/Sequel.ovlStore.BUILDING/1002 FAILED.
--   job correction/Sequel.ovlStore.BUILDING/1003 FAILED.
--   job correction/Sequel.ovlStore.BUILDING/1004 FAILED.
--
--
--
--   job correction/Sequel.ovlStore.BUILDING/8066 FAILED.
--   job correction/Sequel.ovlStore.BUILDING/8067 FAILED.
--   job correction/Sequel.ovlStore.BUILDING/8068 FAILED.
--   job correction/Sequel.ovlStore.BUILDING/8069 FAILED.
--   job correction/Sequel.ovlStore.BUILDING/8070 FAILED.
--

ABORT:
ABORT: Canu 1.6
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:

=======================

Most recent canu command (canu.20.sh)

#  On the off chance that there is a pathMap, and the host we
#  eventually get scheduled on doesn't see other hosts, we decide
#  at run time where the binary is.

syst=`uname -s`
arch=`uname -m`
name=`uname -n`

if [ "$arch" = "x86_64" ] ; then
  arch="amd64"
fi
if [ "$arch" = "Power Macintosh" ] ; then
  arch="ppc"
fi

bin="/ingens/apps/canu-1.6/$syst-$arch/bin"

if [ ! -d "$bin" ] ; then
  bin="/ingens/apps/canu-1.6"
fi

rm -f canu.out
ln -s canu-scripts/canu.20.out canu.out

/usr/bin/env perl \
$bin/canu -p 'Sequel' 'gridOptionsGFA=-l mem=300gb' 'gridOptionsovb=-l mem=80gb' 'gridOptionsovs=-l mem=80gb' 'ovsMemory=10g-300g' 'genom
eSize=2.3g' 'corMaxEvidenceErate=0.15' -pacbio-raw '/ingens/genomics_work/dk_genomics/w22/canu/reads.gz' canuIteration=2

===============================

Most recent canu log (1506444832_vtnode07_9142_canu)

###
###  Reading options from '/ingens/apps/canu-1.6/Linux-amd64/bin/canu.defaults'
###

# Add site specific options (for setting up Grid or limiting memory/threads) here.

###
###  Reading options from the command line.
###

gridOptionsGFA=-l mem=300gb
gridOptionsovb=-l mem=80gb
gridOptionsovs=-l mem=80gb
ovsMemory=10g-300g
genomeSize=2.3g
corMaxEvidenceErate=0.15
canuIteration=2

=================================

Config.err file

Attempting to increase maximum allowed processes and open files.
  Max processes per user limited to 4194304, no increase possible.
  Max open files limited to 1048576, no increase possible.

Found 2648540121076 (2648540.12 million) overlaps.
Configuring for 10.00 GB to 300.00 GB memory and 1048560 open files.
Will sort using 8096 files; 327155712 (327.16 million) overlaps per bucket; 10.00 GB memory per bucket
  bucket    1 has 327667988 olaps.
  bucket    2 has 327729014 olaps.
--
--
--
bucket 8068 has 327829725 olaps.
  bucket 8069 has 327767753 olaps.
  bucket 8070 has 273065652 olaps.
Will sort 327.668 million overlaps per bucket, using 8070 buckets 10.02 GB per bucket.

-  Saved configuration to './Sequel.ovlStore.BUILDING/config'.

=====================================

Sequel.ovlStore.BUILDING/logs file

1-bucketize.361.out-361

Running job 361 based on PBS_ARRAYID=361 and offset=0.

Attempting to increase maximum allowed processes and open files.
  Max processes per user limited to 4194304, no increase possible.
  Max open files limited to 1048576, no increase possible.

maxError fraction: 1.000 percent: 100.000 encoded: 4095
Bucketizing ../1-overlapper/results/000361.ovb
Success!

====================================

Sequel.ovlStore.BUILDING/logs file

2-sort.9.out-9

Running job 9 based on PBS_ARRAYID=9 and offset=0.

Attempting to increase maximum allowed processes and open files.
  Max processes per user limited to 4194304, no increase possible.
  Max open files limited to 1048576, no increase possible.

Job 9 is finished (remove './0009' or -force to try again).

=========================================

Most recent Sequel.ovlStore.BUILDING/scripts 'sh' file

2-sort.jobSubmit-08.sh 

#!/bin/sh

qsub -j oe -d `pwd` \
  -l mem=12g -l nodes=1:ppn=1 -l mem=80gb -o logs/2-sort.\$PBS_ARRAYID.out \
  -N "ovS_Sequel" \
  -t 1-70 \
  ./scripts/2-sort.sh  \
> ./scripts/2-sort.jobSubmit-08.out 2>&1

===========================================

Most recent Sequel.ovlStore.BUILDING/scripts 'out' file

2-sort.jobSubmit-08.out

634295[].nucleus.vitality

===========================================

skoren commented 6 years ago

My guess is this is related to pull request #639, it is supposed to pass an offset but I don't see that offset in your job submit script.

The easiest solution is to increase the memory for each sort batch to reduce the number of jobs. It is currently using 10g so increasing it to 100 (``ovsMemory=100```) should get you under 1000 jobs. You would need to remove the asm.ovlStore.BUILDING folder and re-run the canu command with the added ovsMemory parameter.

dibyendukumar commented 6 years ago

Thanks for the prompt response. We really appreciate it.

In my failed attempt, I increased memory allocation to 80 gb for 'ovs' but didn't change ovsMemory=10-300g. Will increase 'ovsMemory=100g-300g.'

ovlStore.BUILDING folder huge with over 50 Tb data.

skoren commented 6 years ago

Since you've already computed the overlaps and assuming you're not out of space, you can let this run finish. For future runs, you probably have lots of repeat overlaps. You can probably increase the minimum overlap size from 500 and decrease the space.

dibyendukumar commented 6 years ago

Thanks for your help. changing ovsMemory worked, CANU moved to next stage. I will change the minimum overlap size in my next attempt. You are right, genome has over 85% repeat.

dibyendukumar commented 6 years ago

Hi,

Reopened this thread because, I see a similar problem (number of jobs exceeding 1000, kills CANU) at read correction stage. Posting three questions. Please suggest...

Question 1: How to get past read correction stage

Last *.out file in 'canu-scripts'
Read correction jobs failed, tried 2 times, giving up.
--   job correction/2-correction/correction_outputs/1001.fasta FAILED.
--   job correction/2-correction/correction_outputs/1002.fasta FAILED.
--   job correction/2-correction/correction_outputs/1003.fasta FAILED.
--
--   job correction/2-correction/correction_outputs/1023.fasta FAILED.
--   job correction/2-correction/correction_outputs/1024.fasta FAILED.
--
ABORT:
ABORT: Canu 1.6
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.

Last log file in the 'canu-logs' folder

gridOptions=-c enabled
gridOptionsGFA=-l mem=300gb
gridOptionsovb=-l mem=120gb
gridOptionsovs=-l mem=120gb
gridOptionscorovl=-l mem=30gb
gridOptionscor=-l mem=30gb
ovsMemory=120g-300g
corMemory=30g
genomeSize=2.3g
corMaxEvidenceErate=0.15
canuIteration=2

Last *.sh file in 'canu-scripts'

$bin/canu -p 'Sequel' 'gridOptions=-c enabled' 'gridOptionsGFA=-l mem=300gb' 'gridOptionsovb=-l mem=120gb' 'gridOptionsovs=-l mem=120gb' 'gridOptionscorovl=-l mem=30gb' 'gridOptionscor=-l mem=30gb' 'ovsMemory=120g-300g' 'corMemory=30g' 'genomeSize=2.3g' 'corMaxEvidenceErate=0.15' -pacbio-raw '/ingens/genomics_work/dk_genomics/w22/canu/reads.gz' canuIteration=2

Question 2: How to improve corrected read output It appears that I will lose almost 75% reads at this correction stage (most probably due to 'corMaxEvidenceErate=0.15') which is recommended for plants genomes. Should I run another assembly under default condition?

Sequel.readsToCorrect.summary

Corrected read length filter:

  nReads  20,394,594
  nBases  136,447,949,229 (input bases)
  nBases  33,010,356,662 (corrected bases)
  Mean    1,619
  N50     20,593

Raw read length filter:

  nReads  7,793,676
  nBases  92,000,002,472 (input bases)
  nBases  21,729,541,136 (corrected bases)
  Mean    2,788
  N50     20,593

TN         0 reads             0 raw bases (     0 ave)             0 corrected bases (     0 ave)
FN  12,600,918 reads   44,447,946,757 raw bases (  3527 ave)   11,280,815,526 corrected bases (   895 ave)
FP         0 reads             0 raw bases (     0 ave)             0 corrected bases (     0 ave)
TP   7,793,676 reads   92,000,002,472 raw bases ( 11,804 ave)   21,729,541,136 corrected bases (  2,788 ave)

globalScares.stats

PARAMETERS:
----------
     40 (expected coverage)
      0 (don't use overlaps shorter than this)
  0.000 (don't use overlaps with erate less than this)
  0.150 (don't use overlaps with erate more than this)
OVERLAPS:
--------
IGNORED:
           0 (< 0.0000 fraction error)
2,247,827,803,014 (> 0.1500 fraction error)
           0 (< 0 bases long)
           0 (> 2097151 bases long)

FILTERED:
399,952,062,522 (too many overlaps, discard these shortest ones)
EVIDENCE:
760,255,540 (longest overlaps)
TOTAL:
2,648,540,121,076 (all overlaps)
READS:
-----
       24,040 (no overlaps)
     2,899,774 (no overlaps filtered)
     1,514,323 (<  50% overlaps filtered)
     3,942,668 (<  80% overlaps filtered)
     8,682,529 (<  95% overlaps filtered)
    17,494,820 (< 100% overlaps filtered)

Report File

[CORRECTION/CORRECTIONS]
--
-- Reads to be corrected:
--   20,394,594 reads longer than 6832 bp
--   136,447,949,229 bp
-- Expected corrected reads:
--   20,394,594 reads
--   33,010,356,662 bp
--   0 bp minimum length
--   1,619 bp mean length
--   20,593 bp n50 length

Question 3: Where I can find for information about Tn, Tp, Fn and Fp in the *readsToCorrect.summary file and what they stand for?

Thanks a lot...

skoren commented 6 years ago

Did you check the pull request I referenced in the original response? If you update your code as there and recompile it should properly provide an offset then to complete jobs with ID > 1000.
That filter is removing a lot of your overlaps so running without it will likely help. You can remove the 2-correction folder and Canu re-generate the reads to correct without the filter. You can also set corPartition=1000 at the same time time to workaround the 1000 job array offset but you'll probably keep hitting it in other parts of the code unless the code is patched.
Those are going away in the future because they're not very useful. The TP are reads Canu expected to be long which were corrected to be long and used. FP are reads that should have corrected well but didn't.

dibyendukumar commented 6 years ago

Thanks, we haven't updated the code, I will ask our IT group to do it soon.

marbl / canu

canu 1.6 Overlap store sorting jobs failed, retry #650