Closed barke042 closed 11 months ago
What are the stats for the failed job (160360817)? How much memory did it request and how much memory did it use when it was killed? What's the canu report look like up to this point?
Here is the sacct information:
JobID JobIDRaw JobName Partition MaxVMSize MaxVMSizeNode MaxVMSizeTask AveVMSize MaxRSS MaxRSSNode MaxRSSTask AveRSS MaxPages MaxPagesNode MaxPagesTask AvePages MinCPU MinCPUNode MinCPUTask AveCPU NTasks AllocCPUS Elapsed State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ReqMem ConsumedEnergy MaxDiskRead MaxDiskReadNode MaxDiskReadTask AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask AveDiskWrite ReqTRES AllocTRES TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutAve TRESUsageOutTot
------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- ---------- -------------- ------------ --------------- --------------- -------------- ------------ ---------------- ---------------- -------------- ---------- ---------- -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- ------------------- ------------------- --------------- ---------------
160360817 160360817 canu3.sc ram1t 8 00:04:09 OUT_OF_ME+ 0:125 Unknown Unknown Unknown 1000M 0 cpu=8,mem+ cpu=8,mem+
160360817.b+ 160360817.b+ batch 507000K cn2015 0 507000K 428896K cn2015 0 428896K 1 cn2015 0 1 00:03:53 cn2015 0 00:03:53 1 8 00:04:09 OUT_OF_ME+ 0:125 16.18M 0 0 0 0 516.96M cn2015 0 516.96M 0.07M cn2015 0 0.07M cpu=8,mem+ cpu=00:03:53,+ cpu=00:03:53,+ cpu=cn2015,energy+ cpu=0,fs/disk=0,m+ cpu=00:03:53,+ cpu=cn2015,energy+ cpu=0,fs/disk=0,m+ cpu=00:03:53,+ energy=0,fs/di+ energy=cn2015,fs/d+ fs/disk=0 energy=0,fs/di+ energy=0,fs/di+
160360817.e+ 160360817.e+ extern 146820K cn2015 0 146820K 1168K cn2015 0 1168K 0 cn2015 0 0 00:00:00 cn2015 0 00:00:00 1 8 00:04:09 OUT_OF_ME+ 0:125 2.76G 0 0 0 0 0.00M cn2015 0 0.00M 0.00M cn2015 0 0.00M cpu=8,mem+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=cn2015,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=cn2015,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=cn2015,fs/d+ fs/disk=0 energy=0,fs/di+ energy=0,fs/di+
Here is the entire outfile:
-- canu 2.2
--
-- CITATIONS
--
-- For assemblies of PacBio HiFi reads:
-- Nurk S, Walenz BP, Rhiea A, Vollger MR, Logsdon GA, Grothe R, Miga KH, Eichler EE, Phillippy AM, Koren S.
-- HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads.
-- biorXiv. 2020.
-- https://doi.org/10.1101/2020.03.14.992248
--
-- Read and contig alignments during correction and consensus use:
-- Šošic M, Šikic M.
-- Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
-- Bioinformatics. 2017 May 1;33(9):1394-1395.
-- http://doi.org/10.1093/bioinformatics/btw753
--
-- Overlaps are generated using:
-- Berlin K, et al.
-- Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
-- Nat Biotechnol. 2015 Jun;33(6):623-30.
-- http://doi.org/10.1038/nbt.3238
--
-- Myers EW, et al.
-- A Whole-Genome Assembly of Drosophila.
-- Science. 2000 Mar 24;287(5461):2196-204.
-- http://doi.org/10.1126/science.287.5461.2196
--
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
-- Chin CS, et al.
-- Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
-- Nat Methods. 2013 Jun;10(6):563-9
-- http://doi.org/10.1038/nmeth.2474
--
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_101' (from 'java') with -d64 support.
-- Detected gnuplot version '4.6 patchlevel 2 ' (from 'gnuplot') and image format 'png'.
--
-- Detected 8 CPUs and 1000 gigabytes of memory on the local machine.
--
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with task IDs up to 100000 allowed.
--
-- Slurm support detected. Resources available:
-- 16 hosts with 32 cores and 1007 GB memory.
-- 35 hosts with 128 cores and 503 GB memory.
-- 40 hosts with 24 cores and 125 GB memory.
-- 1 host with 24 cores and 754 GB memory.
-- 90 hosts with 128 cores and 251 GB memory.
-- 10 hosts with 24 cores and 61 GB memory.
-- 11 hosts with 24 cores and 376 GB memory.
-- 9 hosts with 128 cores and 2003 GB memory.
-- 31 hosts with 24 cores and 251 GB memory.
-- 583 hosts with 24 cores and 62 GB memory.
--
-- (tag)Threads
-- (tag)Memory |
-- (tag) | | algorithm
-- ------- ---------- -------- -----------------------------
-- Grid: meryl 25.000 GB 8 CPUs (k-mer counting)
-- Grid: hap 16.000 GB 24 CPUs (read-to-haplotype assignment)
-- Grid: cormhap 16.000 GB 8 CPUs (overlap detection with mhap)
-- Grid: obtovl 16.000 GB 8 CPUs (overlap detection)
-- Grid: utgovl 16.000 GB 8 CPUs (overlap detection)
-- Grid: cor -.--- GB 4 CPUs (read correction)
-- Grid: ovb 4.000 GB 1 CPU (overlap store bucketizer)
-- Grid: ovs 32.000 GB 1 CPU (overlap store sorting)
-- Grid: red 16.000 GB 8 CPUs (read error detection)
-- Grid: oea 8.000 GB 1 CPU (overlap error adjustment)
-- Grid: bat 256.000 GB 16 CPUs (contig construction with bogart)
-- Grid: cns -.--- GB 8 CPUs (consensus)
--
-- Found PacBio HiFi reads in 'DumCar.seqStore':
-- Libraries:
-- PacBio HiFi: 1
-- Reads:
-- Corrected: 30074477818
-- Corrected and Trimmed: 30074477818
--
--
-- Generating assembly 'DumCar' in '/scratch.global/barke042/DumCar_v3':
-- genomeSize:
-- 1200000000
--
-- Overlap Generation Limits:
-- corOvlErrorRate 0.0000 ( 0.00%)
-- obtOvlErrorRate 0.0250 ( 2.50%)
-- utgOvlErrorRate 0.0100 ( 1.00%)
--
-- Overlap Processing Limits:
-- corErrorRate 0.0000 ( 0.00%)
-- obtErrorRate 0.0250 ( 2.50%)
-- utgErrorRate 0.0003 ( 0.03%)
-- cnsErrorRate 0.0500 ( 5.00%)
--
-- Stages to run:
-- assemble HiFi reads.
--
--
-- Correction skipped; not enabled.
--
-- Trimming skipped; not enabled.
--
-- BEGIN ASSEMBLY
----------------------------------------
-- Starting command on Mon Nov 27 10:14:44 2023 with 232430.446 GB free disk space
cd unitigging
/panfs/jay/groups/25/barkerfk/shared/canu-2.2/build/bin/utgcns \
-S ../DumCar.seqStore \
-T ./DumCar.ctgStore 1 \
-partition 0.8 1.5 0.1 \
> ./DumCar.ctgStore/partitioning.log 2>&1
sh: line 4: 2947357 Killed /panfs/jay/groups/25/barkerfk/shared/canu-2.2/build/bin/utgcns -S ../DumCar.seqStore -T ./DumCar.ctgStore 1 -partition 0.8 1.5 0.1 > ./DumCar.ctgStore/partitioning.log 2>&1
-- Finished on Mon Nov 27 10:18:45 2023 (241 seconds) with 232398.504 GB free disk space
----------------------------------------
ERROR:
ERROR: Failed with exit code 137. (rc=35072)
ERROR:
ABORT:
ABORT: canu 2.2
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting. If that doesn't work, ask for help.
ABORT:
ABORT: failed to partition the reads.
ABORT:
ABORT: Disk space available: 232398.504 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (unitigging/DumCar.ctgStore/partitioning.log):
ABORT:
ABORT: -- Opening seqStore '../DumCar.seqStore'.
ABORT: -- Using corrected-trimmed reads.
ABORT: -- Opening tigStore './DumCar.ctgStore' version 1.
ABORT:
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=160360817.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
The job was resubmitted after the first crash as:
sbatch -p ram1t --cpus-per-task=8 --mem-per-cpu=125 canu3.sc
Where canu3.sc looks like:
#!/bin/bash -l
#SBATCH --time=96:00:00
#SBATCH --mail-type=ALL
#SBATCH --mail-user=my_email@blah.edu
cd /scratch.global/my_directory
/home/barkerfk/shared/canu-2.2/build/bin/canu -p DumCar -d DumCar_v3 genomesize=1.2g \
gridOptions="--time=96:00:00" \
gridOptionsMERYL="-p small --cpus-per-task=8 --mem-per-cpu=5g" \
gridOptionsHAP="-p ram1t --cpus-per-task=24 --mem-per-cpu=2g" \
gridOptionsCORMHAP="-p ram1t --cpus-per-task=8 --mem-per-cpu=2g" \
gridOptionsOBTOVL="-p ram1t --cpus-per-task=8 --mem-per-cpu=2g" \
gridOptionsUTGOVL="-p ram1t --cpus-per-task=8 --mem-per-cpu=2g" \
gridOptionsCOR="-p ram1t --cpus-per-task=8 --mem-per-cpu=16g" \
gridOptionsOVB="-p small --cpus-per-task=1 --mem-per-cpu=4g" \
gridOptionsOVS="-p ram1t --cpus-per-task=1 --mem-per-cpu=32g" \
gridOptionsRED="-p small --cpus-per-task=8 --mem-per-cpu=2g" \
gridOptionsOEA="-p small --cpus-per-task=1 --mem-per-cpu=8g" \
gridOptionsBAT="-p ram1t --cpus-per-task=16 --mem-per-cpu=16g" \
gridOptionsCNS="-p ram1t --cpus-per-task=8 --mem-per-cpu=125g" \
-pacbio-hifi hifi_reads.fastq.gz
The sacct logs indicate it's being killed after using only about 500mb. I think because you're missing the g on 125 it's requesting 8*125mb
not 8*125gb
. Can you try re-launching with sbatch -p ram1t --cpus-per-task=8 --mem-per-cpu=125g canu3.sc
OMG I contacted the developer over a typo.
Not the first time a typo has wreaked havoc....
We don't encourage setting cpus or memory explicitly in the gridOptions
since the algorithms don't know about these limits and can configure themselves to exceed them. Instead of gridOptionsMERYL="-p small --cpus-per-task=8 --mem-per-cpu=5g"
use merylMemory=40g merylThreads=8
to tell the algorithms their limits; canu will supply the correct --cpus-per-task and --mem-per-cpu when jobs of that type are submitted.
For the canu process itself, use executiveMemory
(default is 4GB) and executiveThreads
. Canu resubmits itself as it runs, forgetting about the limits set in the initial sbatch command.
Finally, if you stick all this boring stuff in ~/.canu
the options will be set on any future runs.
Thanks @brianwalenz for that information, it is very useful. I can then restrict gridOptions entries to the appropriate partitions that have the required resources.
On other fronts, we seem to have finished up generating the ctgStore (a total of 30494 files), but now we are crashing at the next step.
-- BEGIN ASSEMBLY
-- Using slow alignment for consensus (iteration '0').
-- Configured 30494 consensus jobs.
--
-- Grid: cns 0.500 GB 8 CPUs (consensus)
--
--
-- Running jobs. First attempt out of 2.
--
-- Failed to submit compute jobs. Delay 10 seconds and try again.
CRASH:
CRASH: canu 2.2
CRASH: Please panic, this is abnormal.
CRASH:
CRASH: Failed to submit compute jobs.
CRASH:
CRASH: Failed at /panfs/jay/groups/25/barkerfk/shared/canu-2.2/build/bin/../lib/site_perl/canu/Execution.pm line 1259.
CRASH: canu::Execution::submitOrRunParallelJob('DumCar', 'cns', 'unitigging/5-consensus', 'consensus', 1, 2, 3, 4, 5, ...) called at /panfs/jay/groups/25/barkerfk/shared/canu-2.2/build/bin/../lib/site_perl/canu/Consensus.pm line 352
CRASH: canu::Consensus::consensusCheck('DumCar') called at /home/barkerfk/shared/canu-2.2/build/bin/canu line 1135
CRASH:
CRASH: Last 50 lines of the relevant log file (unitigging/5-consensus/consensus.jobSubmit-01.out):
CRASH:
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Batch job submission failed: Resource temporarily unavailable
CRASH:
This is because the job submission script is submitting one job per partition file in ctgStore, but our slurm setup here only allows up to 2000 jobs to be submitted at a time.
Should I rewrite the job submission script "consensus.jobSubmit-01.sh" to split up the sbatch commands into groups of 2000, then submit a slurm job using that script? I assume I would then need a final call to canu to finish things up?
Yes, in those cases you can use useGrid=remote to manually submit the jobs. You can also specify the max size of each array job using gridEngineArrayMaxJobs
, this is normally auto-detected from slurm config though. If there no array limit but a job limit you might run into an issue like this.
That said, this looks like you have a pretty bad assembly, there are over 30k contigs which is not usual. I'm guessing this is because of your low coverage. The *.report
file in the assembly folder should give more details on what the data/assembly look like. Unless you really want this low-coverage assembly, I wouldn't worry about completing it, you're unlikely to end up with over 2000 consensus jobs in a real run.
Uffda. I wonder if it is because of the heterozygosity of the individual sequenced. Based on another assembly, we are estimating ~0.003 heterozygosity. Our first assembly with Flye came out almost completely as haplotigs (twice the the estimated genome size with half the expected coverage). Maybe it would be better to start over with a higher expected error rate?
The 0.003 heterozygosity seems low, have you looked at the k-mer spectrum from the HiFi data using genomescope?
There's really no way to increase the error rate for structural differences, they will still lead to the assembly splitting the haplotypes. Increasing the error rate would also hurt your repeat resolution since you can't tell the difference between near repeats if you're ignoring those types of differences between haplotypes.
Idle
Hi, I'm running canu 2.2 on a partitioned slurm grid. I have gotten to the consensus step but I'm consistently getting an issue at (I believe) the consensus step. The script calls 'utgcns' which starts to work on creating the ctgStore, but then crashes with an OOM error:
I am submitting this stage to a partition with 8 processors and have tried up to 125g memory per processor with the same result. The data are 23g of PacBio HiFi reads representing ~25X coverage of an ~1.2 Gb genome. I know that it is at the low end for coverage but I'm giving it a try before generating more data.
I noticed that the utgcns call did not include a '-threads' argument, so I also tried running the step directly with that added, but got the same result. I am guessing something may have gone awry at an earlier stage, but I'm not sure how to troubleshoot.
Please let me know what additional outputs (if any) you'd like to see.