Closed psur9757 closed 4 years ago
I don't see 4GB in the canu configuration or your command anywhere, where did that value come from? I would not specify corConcurrency or even minMemory, your genome is large enough that it will use more memory anyway and the concurrency doesn't matter on the grid.
The job is likely getting killed due to the faulty gnuplot (which is what it was running right before being killed) to generate plots. Since the plots are optional, I would just not load gnuplot at all, let the command be unknown, and Canu will continue without plotting. You can just re-run your last command to resume the run.
I tried you suggestion and it still dies. I still get the gnuplot error. I know this script has worked with older versions 1.6-1.8.
PBS script
#!/bin/bash
#PBS -P OSR
#PBS -N canu1
#PBS -l select=1:ncpus=8:mem=64GB
#PBS -l walltime=36:00:00
module load canu/2.0
cd canu1
canu -p CR -d canu1 genomeSize=1.6g corOutCoverage=200 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" gridOptions="-P OSR -l walltime=156:00:00" gridOptionsJobName=canu1 -pacbio-raw pacbio_subreads.fq
Canu.out
Found perl:
/usr/bin/perl
Found java:
/usr/bin/java
openjdk version "1.8.0_151"
Found canu:
/usr/local/canu/2.0/bin/canu
Canu 2.0
-- Canu 2.0
--
-- Detected Java(TM) Runtime Environment '1.8.0_151' (from 'java') with -d64 support.
--
-- WARNING:
-- WARNING: Failed to run gnuplot using command 'gnuplot'.
-- WARNING: Plots will be disabled.
-- WARNING:
--
-- Detected 48 CPUs and 188 gigabytes of memory.
-- Detected PBSPro 'PBSPro_13.1.0.160576' with 'pbsnodes' binary in /usr/local/pbs/default/bin/pbsnodes.
-- Detecting PBSPro resources.
--
-- Found 49 hosts with 48 cores and 185 GB memory under PBSPro control.
-- Found 27 hosts with 36 cores and 187 GB memory under PBSPro control.
-- Found 80 hosts with 32 cores and 123 GB memory under PBSPro control.
-- Found 2 hosts with 24 cores and 502 GB memory under PBSPro control.
-- Found 58 hosts with 24 cores and 123 GB memory under PBSPro control.
-- Found 2 hosts with 24 cores and 250 GB memory under PBSPro control.
-- Found 3 hosts with 64 cores and 6057 GB memory under PBSPro control.
--
-- (tag)Threads
-- (tag)Memory |
-- (tag) | | algorithm
-- ------- ---------- -------- -----------------------------
-- Grid: meryl 30.000 GB 8 CPUs (k-mer counting)
-- Grid: hap 16.000 GB 16 CPUs (read-to-haplotype assignment)
-- Grid: cormhap 30.000 GB 8 CPUs (overlap detection with mhap)
-- Grid: obtovl 16.000 GB 8 CPUs (overlap detection)
-- Grid: utgovl 16.000 GB 8 CPUs (overlap detection)
-- Grid: cor 24.000 GB 4 CPUs (read correction)
-- Grid: ovb 4.000 GB 1 CPU (overlap store bucketizer)
-- Grid: ovs 32.000 GB 1 CPU (overlap store sorting)
-- Grid: red 30.000 GB 8 CPUs (read error detection)
-- Grid: oea 8.000 GB 1 CPU (overlap error adjustment)
-- Grid: bat 256.000 GB 16 CPUs (contig construction with bogart)
-- Grid: cns -.--- GB 8 CPUs (consensus)
-- Grid: gfa 32.000 GB 16 CPUs (GFA alignment and processing)
--
-- In 'OSR.seqStore', found PacBio CLR reads:
-- PacBio CLR: 1
--
-- Raw: 1
--
-- Generating assembly 'OSR' in '/scratch/RDS-FAE-OSR-RW/coffee_rust/1.assembly/2.results/canu1':
-- - correct raw reads.
-- - trim corrected reads.
-- - assemble corrected and trimmed reads.
--
-- Parameters:
--
-- genomeSize 1600000000
--
-- Overlap Generation Limits:
-- corOvlErrorRate 0.2400 ( 24.00%)
-- obtOvlErrorRate 0.0450 ( 4.50%)
-- utgOvlErrorRate 0.0450 ( 4.50%)
--
-- Overlap Processing Limits:
-- corErrorRate 0.3000 ( 30.00%)
-- obtErrorRate 0.0450 ( 4.50%)
-- utgErrorRate 0.0450 ( 4.50%)
-- cnsErrorRate 0.0750 ( 7.50%)
--
--
-- BEGIN CORRECTION
--
-- No change in report.
-- Meryl finished successfully. Kmer frequency histogram:
--
-- WARNING: gnuplot failed.
--
----------------------------------------
/var/spool/PBS/mom_priv/jobs/4220843.pbsserver.SC: line 51: 16739 Killed /usr/bin/env perl $bin/canu -p 'OSR' 'genomeSize=1.6g' 'corOutCover
age=200' 'batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50' 'gridOptions=-P OSR -l walltime=156:00:00' 'gridOptionsJobName=canu1' -pacbio-raw '/scratch/RDS-FAE-OS
R-RW/coffee_rust/1.assembly/1.data/pacbio_subreads.fq' canuIteration=1
I also checked with our HPC team regarding one of the fails with the first script and here is what they said:
You are not asking for enough memory.
The error on the node is:
4/22/2020 13:37:23;0008;pbs_python;Job;4204384.pbsserver;Cgroup memory limit exceeded: Killed process 401565, UID 596213, (perl) total-vm:5152684kB, anon-rss:4193824kB, file-rss:2088kB
Looks like you only asked for 4gb.
Associated command: canu -p CR -d canu1 genomeSize=800m corOutCoverage=200 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" gridOptions="-P OSR -l walltime=156:00:00" gridOptionsJobName=canu1 corConcurrency=4 -pacbio-raw pacbio_subreads.fq
Ah right, I didn't think about the executive itself requesting less memory than 16gb. The minimum memory is only applied to the steps in the pipeline not the executive itself which we should fix. You want executiveMemory=16 (or 32) which will also increase the grid request to 16gb.
So I updated script and ran into another error.
PBS Script:
#!/bin/bash
#PBS -P OSR
#PBS -N canu1
#PBS -l select=1:ncpus=8:mem=64GB
#PBS -l walltime=36:00:00
module load canu/2.0
canu -p OSR -d $outdir genomeSize=1.6g corOutCoverage=200 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" gridOptions="-P OSR -l walltime=156:00:00" gridOptionsJobName=canu1 executiveMemory=32 -pacbio-raw $input
Canu.out
Found perl:
/usr/bin/perl
Found java:
/usr/bin/java
openjdk version "1.8.0_151"
Found canu:
/usr/local/canu/2.0/bin/canu
Canu 2.0
-- Canu 2.0
--
-- Detected Java(TM) Runtime Environment '1.8.0_151' (from 'java') with -d64 support.
--
-- WARNING:
-- WARNING: Failed to run gnuplot using command 'gnuplot'.
-- WARNING: Plots will be disabled.
-- WARNING:
--
-- Detected 64 CPUs and 6059 gigabytes of memory.
-- Detected PBSPro 'PBSPro_13.1.0.160576' with 'pbsnodes' binary in /usr/local/pbs/default/bin/pbsnodes.
-- Detecting PBSPro resources.
--
-- Found 49 hosts with 48 cores and 185 GB memory under PBSPro control.
-- Found 27 hosts with 36 cores and 187 GB memory under PBSPro control.
-- Found 80 hosts with 32 cores and 123 GB memory under PBSPro control.
-- Found 2 hosts with 24 cores and 502 GB memory under PBSPro control.
-- Found 58 hosts with 24 cores and 123 GB memory under PBSPro control.
-- Found 2 hosts with 24 cores and 250 GB memory under PBSPro control.
-- Found 3 hosts with 64 cores and 6057 GB memory under PBSPro control.
--
-- (tag)Threads
-- (tag)Memory |
-- (tag) | | algorithm
-- ------- ---------- -------- -----------------------------
-- Grid: meryl 30.000 GB 8 CPUs (k-mer counting)
-- Grid: hap 16.000 GB 16 CPUs (read-to-haplotype assignment)
-- Grid: cormhap 30.000 GB 8 CPUs (overlap detection with mhap)
-- Grid: obtovl 16.000 GB 8 CPUs (overlap detection)
-- Grid: utgovl 16.000 GB 8 CPUs (overlap detection)
-- Grid: cor 24.000 GB 4 CPUs (read correction)
-- Grid: ovb 4.000 GB 1 CPU (overlap store bucketizer)
-- Grid: ovs 32.000 GB 1 CPU (overlap store sorting)
-- Grid: red 30.000 GB 8 CPUs (read error detection)
-- Grid: oea 8.000 GB 1 CPU (overlap error adjustment)
-- Grid: bat 256.000 GB 16 CPUs (contig construction with bogart)
-- Grid: cns -.--- GB 8 CPUs (consensus)
-- Grid: gfa 32.000 GB 16 CPUs (GFA alignment and processing)
--
-- In 'OSR.seqStore', found PacBio CLR reads:
-- PacBio CLR: 1
--
-- Raw: 1
--
-- Generating assembly 'OSR' in '/scratch/RDS-FAE-OSR-RW/coffee_rust/1.assembly/2.results/canu1':
-- - correct raw reads.
-- - trim corrected reads.
-- - assemble corrected and trimmed reads.
--
-- Parameters:
--
-- genomeSize 1600000000
--
-- Overlap Generation Limits:
-- corOvlErrorRate 0.2400 ( 24.00%)
-- obtOvlErrorRate 0.0450 ( 4.50%)
-- utgOvlErrorRate 0.0450 ( 4.50%)
--
-- Overlap Processing Limits:
-- corErrorRate 0.3000 ( 30.00%)
-- obtErrorRate 0.0450 ( 4.50%)
-- utgErrorRate 0.0450 ( 4.50%)
-- cnsErrorRate 0.0750 ( 7.50%)
--
--
-- BEGIN CORRECTION
--
-- No change in report.
--
-- OVERLAPPER (mhap) (correction) complete, not rewriting scripts.
--
--
-- Mhap overlap jobs failed, tried 2 times, giving up.
-- job correction/1-overlapper/results/000042.ovb FAILED.
-- job correction/1-overlapper/results/000052.ovb FAILED.
-- job correction/1-overlapper/results/000054.ovb FAILED.
-- job correction/1-overlapper/results/000055.ovb FAILED.
-- job correction/1-overlapper/results/000056.ovb FAILED.
-- job correction/1-overlapper/results/000060.ovb FAILED.
-- job correction/1-overlapper/results/000062.ovb FAILED.
-- job correction/1-overlapper/results/000063.ovb FAILED.
-- job correction/1-overlapper/results/000064.ovb FAILED.
-- job correction/1-overlapper/results/000065.ovb FAILED.
-- job correction/1-overlapper/results/000066.ovb FAILED.
-- job correction/1-overlapper/results/000067.ovb FAILED.
-- job correction/1-overlapper/results/000068.ovb FAILED.
-- job correction/1-overlapper/results/000069.ovb FAILED.
-- job correction/1-overlapper/results/000070.ovb FAILED.
-- job correction/1-overlapper/results/000071.ovb FAILED.
-- job correction/1-overlapper/results/000072.ovb FAILED.
-- job correction/1-overlapper/results/000073.ovb FAILED.
-- job correction/1-overlapper/results/000074.ovb FAILED.
-- job correction/1-overlapper/results/000075.ovb FAILED.
-- job correction/1-overlapper/results/000076.ovb FAILED.
-- job correction/1-overlapper/results/000090.ovb FAILED.
-- job correction/1-overlapper/results/000092.ovb FAILED.
-- job correction/1-overlapper/results/000098.ovb FAILED.
--
ABORT:
ABORT: Canu 2.0
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting. If that doesn't work, ask for help.
ABORT:
What parameters do I need to change? Thank you!!
This is usually an issue with the JVM configuration on your cluster, it can be configured to have extra overhead that isn't present on our test clusters. You could explicitly increase the memory for the JVM job (gridOptionsMhap="-l select=1:ncpus=8:mem=64GB"
) but it would be good to investigate why it's using more memory first. What is the error from one of the failing files (correction/1-overlapper/*98*.out
for example)?
Is this the correct file?
$ tail -50 correction/1-overlapper/*.98
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
Current # sequences loaded and processed from file: 20000...
Current # sequences loaded and processed from file: 25000...
Current # sequences loaded and processed from file: 30000...
Current # sequences loaded and processed from file: 35000...
Current # sequences loaded and processed from file: 40000...
Current # sequences loaded and processed from file: 45000...
Current # sequences loaded and processed from file: 50000...
Current # sequences loaded and processed from file: 55000...
Current # sequences loaded and processed from file: 60000...
Current # sequences loaded and processed from file: 65000...
Current # sequences loaded and processed from file: 70000...
Current # sequences loaded and processed from file: 75000...
Current # sequences loaded and processed from file: 80000...
Processed 81000 to sequences.
Time (s) to score, hash to-file, and output: 465.81968653200005
Opened fasta file /scratch/RDS-FAE-OSR-RW/coffee_rust/1.assembly/2.results/canu1/correction/1-overlapper/blocks/000069.dat.
Current # sequences loaded and processed from file: 5000...
Current # sequences loaded and processed from file: 10000...
Current # sequences loaded and processed from file: 15000...
Current # sequences loaded and processed from file: 20000...
Current # sequences loaded and processed from file: 25000...
Current # sequences loaded and processed from file: 30000...
Current # sequences loaded and processed from file: 35000...
Current # sequences loaded and processed from file: 40000...
Current # sequences loaded and processed from file: 45000...
Current # sequences loaded and processed from file: 50000...
Current # sequences loaded and processed from file: 55000...
Current # sequences loaded and processed from file: 60000...
Current # sequences loaded and processed from file: 65000...
Current # sequences loaded and processed from file: 70000...
Current # sequences loaded and processed from file: 75000...
Current # sequences loaded and processed from file: 80000...
Processed 81000 to sequences.
Time (s) to score, hash to-file, and output: 727.9392699240001
Total scoring time (s): 9608.808452318
Total time (s): 9624.379266296
MinHash search time (s): 6142.355619701
Total matches found: 864643020
Average number of matches per lookup: 485.2093265993266
Average number of table elements processed per lookup: 13733.510815375983
Average number of table elements processed per match: 28.304300973828482
Average % of hashed sequences hit per lookup: 6.272848686106608
Average % of hashed sequences hit that are matches: 4.774735450943637
Average % of hashed sequences fully compared that are matches: 72.42336803473859
sqStore_loadMetadata()-- Using 'raw' 0x01 reads.
mhapConvert: mhap/mhapConvert.C:122: int main(int, char**): Assertion `W.toint32(10) <= W.toint32(11)' failed.
/scratch/RDS-FAE-OSR-RW/coffee_rust/1.assembly/2.results/canu1/correction/1-overlapper/mhap.sh: line 1485: 360055 Aborted (core dumped) $bin/mhapConvert -S ../../OSR.seqStore -o ./results/$qry.mhap.ovb.WORKING $outPath/$qry.mhap
That looks like a disk corruption/space issue. I expect the input file (/scratch/RDS-FAE-OSR-RW/coffee_rust/1.assembly/2.results/canu1/correction/1-overlapper/results/*98*mhap
) is truncated or corrupted. Did this job get killed by your scheduler or was there some other error on the system when it was running? Are you out of quota disk space? I expect out of space is the likeliest explanation.
If you confirm the above aren't an issue (not out of space/not getting killed), you can remove the intermediate bad files (*.mhap
and *.WORKING
in the correction/1-overlapper/results folder) and re-run the original command.
HPC team confirmed that scratch has become low on space. So I deleted the intermediate bad files and restarted the command.
The updated script ran into another error, I cannot figure it out:
Canu.out
Found perl:
/usr/bin/perl
Found java:
/usr/bin/java
openjdk version "1.8.0_151"
Found canu:
/usr/local/canu/2.0/bin/canu
Canu 2.0
-- Canu 2.0
--
-- Detected Java(TM) Runtime Environment '1.8.0_151' (from 'java') with -d64 support.
--
-- WARNING:
-- WARNING: Failed to run gnuplot using command 'gnuplot'.
-- WARNING: Plots will be disabled.
-- WARNING:
--
-- Detected 64 CPUs and 6059 gigabytes of memory.
-- Detected PBSPro 'PBSPro_13.1.0.160576' with 'pbsnodes' binary in /usr/local/pbs/default/bin/pbsnodes.
-- Detecting PBSPro resources.
--
-- Found 49 hosts with 48 cores and 185 GB memory under PBSPro control.
-- Found 27 hosts with 36 cores and 187 GB memory under PBSPro control.
-- Found 80 hosts with 32 cores and 123 GB memory under PBSPro control.
-- Found 2 hosts with 24 cores and 502 GB memory under PBSPro control.
-- Found 58 hosts with 24 cores and 123 GB memory under PBSPro control.
-- Found 2 hosts with 24 cores and 250 GB memory under PBSPro control.
-- Found 3 hosts with 64 cores and 6057 GB memory under PBSPro control.
--
-- (tag)Threads
-- (tag)Memory |
-- (tag) | | algorithm
-- ------- ---------- -------- -----------------------------
-- Grid: meryl 30.000 GB 8 CPUs (k-mer counting)
-- Grid: hap 16.000 GB 16 CPUs (read-to-haplotype assignment)
-- Grid: cormhap 30.000 GB 8 CPUs (overlap detection with mhap)
-- Grid: obtovl 16.000 GB 8 CPUs (overlap detection)
-- Grid: utgovl 16.000 GB 8 CPUs (overlap detection)
-- Grid: cor 24.000 GB 4 CPUs (read correction)
-- Grid: ovb 4.000 GB 1 CPU (overlap store bucketizer)
-- Grid: ovs 32.000 GB 1 CPU (overlap store sorting)
-- Grid: red 30.000 GB 8 CPUs (read error detection)
-- Grid: oea 8.000 GB 1 CPU (overlap error adjustment)
-- Grid: bat 256.000 GB 16 CPUs (contig construction with bogart)
-- Grid: cns -.--- GB 8 CPUs (consensus)
-- Grid: gfa 32.000 GB 16 CPUs (GFA alignment and processing)
--
-- In 'OSR.seqStore', found PacBio CLR reads:
-- PacBio CLR: 1
--
-- Raw: 1
--
-- Generating assembly 'OSR' in '/scratch/RDS-FAE-OSR-RW/coffee_rust/1.assembly/2.results/canu1':
-- - correct raw reads.
-- - trim corrected reads.
-- - assemble corrected and trimmed reads.
--
-- Parameters:
--
-- genomeSize 1600000000
--
-- Overlap Generation Limits:
-- corOvlErrorRate 0.2400 ( 24.00%)
-- obtOvlErrorRate 0.0450 ( 4.50%)
-- utgOvlErrorRate 0.0450 ( 4.50%)
--
-- Overlap Processing Limits:
-- corErrorRate 0.3000 ( 30.00%)
-- obtErrorRate 0.0450 ( 4.50%)
-- utgErrorRate 0.0450 ( 4.50%)
-- cnsErrorRate 0.0750 ( 7.50%)
--
--
-- BEGIN CORRECTION
--
-- No change in report.
--
-- Creating overlap store correction/OSR.ovlStore using:
-- 226 buckets
-- 238 slices
-- using at most 29 GB memory each
-- Overlap store sorter finished.
-- No change in report.
-- Finished stage 'cor-overlapStoreSorterCheck', reset canuIteration.
----------------------------------------
-- Starting command on Sat May 9 11:34:19 2020 with 7755.833 GB free disk space
cd correction
/usr/local/canu/2.0/bin/ovStoreIndexer \
-O ./OSR.ovlStore.BUILDING \
-S ../OSR.seqStore \
-C ./OSR.ovlStore.config \
-delete \
> ./OSR.ovlStore.BUILDING.index.err 2>&1
-- Finished on Sat May 9 11:39:36 2020 (317 seconds) with 7748.516 GB free disk space
----------------------------------------
-- Checking store.
----------------------------------------
-- Starting command on Sat May 9 11:39:36 2020 with 7748.516 GB free disk space
cd correction
/usr/local/canu/2.0/bin/ovStoreDump \
-S ../OSR.seqStore \
-O ./OSR.ovlStore \
-counts \
> ./OSR.ovlStore/counts.dat 2> ./OSR.ovlStore/counts.err
-- Finished on Sat May 9 11:39:41 2020 (5 seconds) with 7769.224 GB free disk space
----------------------------------------
--
-- Overlap store 'correction/OSR.ovlStore' successfully constructed.
-- Found 293773456141 overlaps for 7117369 reads; 95252 reads have no overlaps.
--
--
-- Purged 2908.812 GB in 542 overlap output files.
-- No change in report.
-- Finished stage 'cor-createOverlapStore', reset canuIteration.
-- Set corMinCoverage=4 based on read coverage of 103.
-- Global filter scores will be estimated.
-- Computing correction layouts.
----------------------------------------
-- Starting command on Sat May 9 11:39:56 2020 with 7918.803 GB free disk space
cd correction
/usr/local/canu/2.0/bin/generateCorrectionLayouts \
-S ../OSR.seqStore \
-O ./OSR.ovlStore \
-C ./OSR.corStore.WORKING \
-eC 400 \
> ./OSR.corStore.err 2>&1
sh: line 5: 51781 Aborted (core dumped) /usr/local/canu/2.0/bin/generateCorrectionLayouts -S ../OSR.seqStore -O ./OSR.ovlStore -C ./OSR.corSt
ore.WORKING -eC 400 > ./OSR.corStore.err 2>&1
-- Finished on Sat May 9 23:55:52 2020 (44156 seconds) with 9260.643 GB free disk space
----------------------------------------
ERROR:
ERROR: Failed with exit code 134. (rc=34304)
ERROR:
ABORT:
ABORT: Canu 2.0
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting. If that doesn't work, ask for help.
ABORT:
ABORT: failed to generate correction layouts.
ABORT:
ABORT: Disk space available: 9260.643 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (correction/OSR.corStore.err):
ABORT:
ABORT: sqStore_loadMetadata()-- Using 'raw' 0x01 reads.
ABORT: ERROR: bogus overlap ' 5489659 216378 I 0 0 11399 4294951323 18432 0.187000'
ABORT: generateCorrectionLayouts: correction/generateCorrectionLayouts.C:114: void generateLayout(tgTig*, uint16*, uint32, double, double, ovOverlap*, uint
32, FILE*): Assertion `ovlLength < (((uint32)1 << 21) - 1)' failed.
ABORT:
ABORT: Failed with 'Aborted'; backtrace (libbacktrace):
ABORT: utility/system-stackTrace.C::89 in _Z17AS_UTL_catchCrashiP7siginfoPv()
ABORT: (null)::0 in (null)()
ABORT: (null)::0 in (null)()
ABORT: (null)::0 in (null)()
ABORT: (null)::0 in (null)()
ABORT: (null)::0 in (null)()
ABORT: correction/generateCorrectionLayouts.C::114 in generateLayout()
ABORT: correction/generateCorrectionLayouts.C::351 in main()
ABORT: (null)::0 in (null)()
ABORT: (null)::0 in (null)()
ABORT:
Canu.pbs
#!/bin/bash
#PBS -P OSR
#PBS -N canu1
#PBS -l select=1:ncpus=8:mem=64GB
#PBS -l walltime=36:00:00
module load canu/2.0
canu -p OSR -d $outdir genomeSize=1.6g corOutCoverage=200 "batOptions=-dg 3 -db 3 -dr 1 -ca 500 -cp 50" gridOptions="-P OSR -l walltime=156:00:00" gridOptionsMhap="-l select=1:ncpus=8:mem=64GB" gridOptionsJobName=canu1 executiveMemory=32 -pacbio-raw $input
I think this is still the result of disk corruption. The store gets checked right after building which implies it was OK and then was corrupted at some latter point. I'd suggest checking with your IT group if your FS was experiencing heavy load (due to canu or not) and seeing if they have any suggestions to help mitigate these errors in the future. My concern is you'll keep hitting these types of errors throughout your run.
Unfortunately it isn't so easy to fix disk corruption errors like this. The easiest option for you, if you're able to, is to modify the code to ignore the bad overlaps and continue on rather than failing. Hopefully not too many overlaps are corrupt in this way so it would have minimal effects on the run. You'd have to modify (I can share a binary if you prefer with these modifications):
110 if (ovlLength > AS_MAX_READLEN) {
111 char ovlString[1024];
112 fprintf(stderr, "ERROR: bogus overlap '%s'\n", ovl[oo].toString(ovlString, ovOverlapAsCoords, false));
113 }
to
110 if (ovlLength > AS_MAX_READLEN) {
111 char ovlString[1024];
112 fprintf(stderr, "ERROR: bogus overlap '%s'\n", ovl[oo].toString(ovlString, ovOverlapAsCoords, false));
+ continue;
113 }
I talked to the HPC and they asked me to start a fresh job once they had sorted a space issue. They insist there is no disk corruption. I did the restart and got an error again. Is it the same issue? It seems the same to me.
Error:
Found perl:
/usr/bin/perl
Found java:
/usr/bin/java
openjdk version "1.8.0_151"
Found canu:
/usr/local/canu/2.0/bin/canu
Canu 2.0
-- Canu 2.0
--
-- Detected Java(TM) Runtime Environment '1.8.0_151' (from 'java') with -d64 support.
--
-- WARNING:
-- WARNING: Failed to run gnuplot using command 'gnuplot'.
-- WARNING: Plots will be disabled.
-- WARNING:
--
-- Detected 64 CPUs and 6059 gigabytes of memory.
-- Detected PBSPro 'PBSPro_13.1.0.160576' with 'pbsnodes' binary in /usr/local/pbs/default/bin/pbsnodes.
-- Detecting PBSPro resources.
--
-- Found 49 hosts with 48 cores and 185 GB memory under PBSPro control.
-- Found 27 hosts with 36 cores and 187 GB memory under PBSPro control.
-- Found 80 hosts with 32 cores and 123 GB memory under PBSPro control.
-- Found 2 hosts with 24 cores and 502 GB memory under PBSPro control.
-- Found 58 hosts with 24 cores and 123 GB memory under PBSPro control.
-- Found 2 hosts with 24 cores and 250 GB memory under PBSPro control.
-- Found 3 hosts with 64 cores and 6057 GB memory under PBSPro control.
--
-- (tag)Threads
-- (tag)Memory |
-- (tag) | | algorithm
-- ------- ---------- -------- -----------------------------
-- Grid: meryl 30.000 GB 8 CPUs (k-mer counting)
-- Grid: hap 16.000 GB 16 CPUs (read-to-haplotype assignment)
-- Grid: cormhap 30.000 GB 8 CPUs (overlap detection with mhap)
-- Grid: obtovl 16.000 GB 8 CPUs (overlap detection)
-- Grid: utgovl 16.000 GB 8 CPUs (overlap detection)
-- Grid: cor 24.000 GB 4 CPUs (read correction)
-- Grid: ovb 4.000 GB 1 CPU (overlap store bucketizer)
-- Grid: ovs 32.000 GB 1 CPU (overlap store sorting)
-- Grid: red 30.000 GB 8 CPUs (read error detection)
-- Grid: oea 8.000 GB 1 CPU (overlap error adjustment)
-- Grid: bat 256.000 GB 16 CPUs (contig construction with bogart)
-- Grid: cns -.--- GB 8 CPUs (consensus)
-- Grid: gfa 32.000 GB 16 CPUs (GFA alignment and processing)
--
-- In 'CR.seqStore', found PacBio CLR reads:
-- PacBio CLR: 1
--
-- Raw: 1
--
-- Generating assembly 'CR' in '/scratch/RDS-FAE-OSR-RW/coffee_rust/1.assembly/2.results/canu2':
-- - correct raw reads.
-- - trim corrected reads.
-- - assemble corrected and trimmed reads.
--
-- Parameters:
--
-- genomeSize 1600000000
--
-- Overlap Generation Limits:
-- corOvlErrorRate 0.2400 ( 24.00%)
-- obtOvlErrorRate 0.0450 ( 4.50%)
-- utgOvlErrorRate 0.0450 ( 4.50%)
--
-- Overlap Processing Limits:
-- corErrorRate 0.3000 ( 30.00%)
-- obtErrorRate 0.0450 ( 4.50%)
-- utgErrorRate 0.0450 ( 4.50%)
-- cnsErrorRate 0.0750 ( 7.50%)
--
--
-- BEGIN CORRECTION
--
-- No change in report.
--
-- Creating overlap store correction/CR.ovlStore using:
-- 226 buckets
-- 238 slices
-- using at most 29 GB memory each
-- Overlap store sorter finished.
-- No change in report.
-- Finished stage 'cor-overlapStoreSorterCheck', reset canuIteration.
----------------------------------------
-- Starting command on Wed May 20 23:11:56 2020 with 10652.791 GB free disk space
cd correction
/usr/local/canu/2.0/bin/ovStoreIndexer \
-O ./CR.ovlStore.BUILDING \
-S ../CR.seqStore \
-C ./CR.ovlStore.config \
-delete \
> ./CR.ovlStore.BUILDING.index.err 2>&1
-- Finished on Wed May 20 23:20:41 2020 (525 seconds) with 10591.13 GB free disk space
----------------------------------------
-- Checking store.
----------------------------------------
-- Starting command on Wed May 20 23:20:41 2020 with 10591.13 GB free disk space
cd correction
/usr/local/canu/2.0/bin/ovStoreDump \
-S ../CR.seqStore \
-O ./CR.ovlStore \
-counts \
> ./CR.ovlStore/counts.dat 2> ./CR.ovlStore/counts.err
-- Finished on Wed May 20 23:20:43 2020 (2 seconds) with 10597.453 GB free disk space
----------------------------------------
--
-- Overlap store 'correction/CR.ovlStore' successfully constructed.
-- Found 294084131436 overlaps for 7117440 reads; 95181 reads have no overlaps.
--
--
-- Purged 2908.811 GB in 542 overlap output files.
-- No change in report.
-- Finished stage 'cor-createOverlapStore', reset canuIteration.
-- Set corMinCoverage=4 based on read coverage of 103.
-- Global filter scores will be estimated.
-- Computing correction layouts.
----------------------------------------
-- Starting command on Wed May 20 23:20:54 2020 with 10624.02 GB free disk space
cd correction
/usr/local/canu/2.0/bin/generateCorrectionLayouts \
-S ../CR.seqStore \
-O ./CR.ovlStore \
-C ./CR.corStore.WORKING \
-eC 400 \
> ./CR.corStore.err 2>&1
sh: line 5: 70144 Aborted (core dumped) /usr/local/canu/2.0/bin/generateCorrectionLayouts -S ../CR.seqStore -O ./CR.ovlStore -C ./CR.corStore
.WORKING -eC 400 > ./CR.corStore.err 2>&1
-- Finished on Thu May 21 06:37:48 2020 (26214 seconds) with 12579.825 GB free disk space
----------------------------------------
ERROR:
ERROR: Failed with exit code 134. (rc=34304)
ERROR:
ABORT:
ABORT: Canu 2.0
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting. If that doesn't work, ask for help.
ABORT:
ABORT: failed to generate correction layouts.
ABORT:
ABORT: Disk space available: 12579.825 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (correction/CR.corStore.err):
ABORT:
ABORT: sqStore_loadMetadata()-- Using 'raw' 0x01 reads.
ABORT: generateCorrectionLayouts: stores/sqStore.H:399: uint32 sqStore::sqStore_getReadLength(uint32, sqRead_which): Assertion `id > 0' failed.
ABORT:
ABORT: Failed with 'Aborted'; backtrace (libbacktrace):
ABORT: utility/system-stackTrace.C::89 in _Z17AS_UTL_catchCrashiP7siginfoPv()
ABORT: (null)::0 in (null)()
ABORT: (null)::0 in (null)()
ABORT: (null)::0 in (null)()
ABORT: (null)::0 in (null)()
ABORT: (null)::0 in (null)()
ABORT: stores/sqStore.H::399 in _ZN7sqStore21sqStore_getReadLengthEjj()
ABORT: stores/ovOverlap.H::205 in _ZNK9ovOverlap5b_lenEj()
ABORT: correction/generateCorrectionLayouts.C::107 in generateLayout()
ABORT: correction/generateCorrectionLayouts.C::351 in main()
ABORT: (null)::0 in (null)()
ABORT: (null)::0 in (null)()
ABORT:
This still looks like disk corruption (or at least 0-bytes where Canu didn't expect there to be 0s). The easiest option now that the space issue is resolved is to erase the overlap store and let canu re-compute overlaps. If it took a very long time to get to this point in the assembly, the code workaround I suggested may be faster.
Idle, seemed to be caused by out of space/disk corruption.
My code keeping using 4GB RAM and getting killed even though I specify minMemory=16. Also gnuplot fails every time.
PBS Script
Canu.out
Any help would be much appreciated.