Closed fjruizruano closed 3 years ago
The default HiFi settings will downsample to 50x coverage. It's hard to say how long it should take without more info on what kind of system you're using. Post the log from the Canu run which will list how many jobs are running at a time and how many cores are in use.
Thanks for your answer. This is the log got:
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_292' (from 'java') with -d64 support.
-- Detected gnuplot version '4.6 patchlevel 2 ' (from 'gnuplot') and image format 'png'.
-- Detected 20 CPUs and 125 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Grid engine and staging disabled per useGrid=false option.
--
-- (tag)Concurrency
-- (tag)Threads |
-- (tag)Memory | |
-- (tag) | | | total usage algorithm
-- ------- ---------- -------- -------- -------------------- -----------------------------
-- Local: meryl 31.000 GB 5 CPUs x 4 jobs 124.000 GB 20 CPUs (k-mer counting)
-- Local: hap 16.000 GB 20 CPUs x 1 job 16.000 GB 20 CPUs (read-to-haplotype assignment)
-- Local: cormhap 32.000 GB 10 CPUs x 2 jobs 64.000 GB 20 CPUs (overlap detection with mhap)
-- Local: obtovl 16.000 GB 10 CPUs x 2 jobs 32.000 GB 20 CPUs (overlap detection)
-- Local: utgovl 16.000 GB 10 CPUs x 2 jobs 32.000 GB 20 CPUs (overlap detection)
-- Local: cor 24.000 GB 4 CPUs x 5 jobs 120.000 GB 20 CPUs (read correction)
-- Local: ovb 4.000 GB 1 CPU x 20 jobs 80.000 GB 20 CPUs (overlap store bucketizer)
-- Local: ovs 32.000 GB 1 CPU x 3 jobs 96.000 GB 3 CPUs (overlap store sorting)
-- Local: red 31.000 GB 5 CPUs x 4 jobs 124.000 GB 20 CPUs (read error detection)
-- Local: oea 8.000 GB 1 CPU x 15 jobs 120.000 GB 15 CPUs (overlap error adjustment)
-- Local: bat 125.000 GB 16 CPUs x 1 job 125.000 GB 16 CPUs (contig construction with bogart)
-- Local: cns -.--- GB 8 CPUs x - jobs -.--- GB - CPUs (consensus)
--
-- In 'asm_si.seqStore', found PacBio HiFi reads:
-- PacBio HiFi: 1
--
-- Corrected: 1
-- Corrected and Trimmed: 1
--
-- Generating assembly 'asm_si' in '/crex/proj/sllstore2017073/private/GRC/Paco/sumk/canu/asm_si':
-- - assemble HiFi reads.
--
-- Parameters:
--
-- genomeSize 1500000000
--
-- Overlap Generation Limits:
-- corOvlErrorRate 0.0000 ( 0.00%)
-- obtOvlErrorRate 0.0250 ( 2.50%)
-- utgOvlErrorRate 0.0100 ( 1.00%)
--
-- Overlap Processing Limits:
-- corErrorRate 0.0000 ( 0.00%)
-- obtErrorRate 0.0250 ( 2.50%)
-- utgErrorRate 0.0100 ( 1.00%)
-- cnsErrorRate 0.0500 ( 5.00%)
--
--
-- BEGIN ASSEMBLY
--
--
-- Running jobs. First attempt out of 2.
----------------------------------------
-- Starting 'utgovl' concurrent execution on Mon Jul 19 22:57:59 2021 with 498058.86 GB free disk space (334 processes; 2 concurrently)
cd unitigging/1-overlapper
./overlap.sh 47 > ./overlap.000047.out 2>&1
./overlap.sh 49 > ./overlap.000049.out 2>&1
./overlap.sh 50 > ./overlap.000050.out 2>&1
./overlap.sh 51 > ./overlap.000051.out 2>&1
./overlap.sh 52 > ./overlap.000052.out 2>&1
./overlap.sh 53 > ./overlap.000053.out 2>&1
./overlap.sh 54 > ./overlap.000054.out 2>&1
./overlap.sh 55 > ./overlap.000055.out 2>&1
./overlap.sh 56 > ./overlap.000056.out 2>&1
./overlap.sh 57 > ./overlap.000057.out 2>&1
./overlap.sh 58 > ./overlap.000058.out 2>&1
./overlap.sh 59 > ./overlap.000059.out 2>&1
./overlap.sh 60 > ./overlap.000060.out 2>&1
./overlap.sh 61 > ./overlap.000061.out 2>&1
./overlap.sh 62 > ./overlap.000062.out 2>&1
./overlap.sh 63 > ./overlap.000063.out 2>&1
Using only 20 cores isn't ideal. A typically human assembly takes about 2k cpu hours which would be about a week on 20 cores. It can take longer on large or repetitive genomes. If you have access to the cluster, you can let Canu submit sjobs which will allow more than 2 jobs to run concurrently and will speed up the run. You'd have to remove the 1-overlapper/overlap.sh file and re-start removing useGrid=false
.
Another option would be to check the repeat k-mer threshold. You can check the current threshold by looking for the smallest value in unitigging/0-mercounts/*.dump
, sorting by the second column and piping to head will work. If it's over about 250-500 you can lower it using either an explicit one utgOvlMerThreshold=300
or a fractional one utgOvlMerDistinct=0.97
. Doing this would require you to start the assembly from scratch but individual jobs should run faster.
Thanks a lot for your help! I will follow your suggestions. Best.
Hi!
I am running HiCanu using a library with a coverage around 80x for a genome of around 1.5 Gb. However, it is still running the step "utgovl" (folder unitigging/1-overlapper) for weeks. So, I wonder if I should use a lower coverage. This is the command I run:
canu -p asm_si -d asm_si useGrid=false genomeSize=1500m saveReadCorrections=true saveReadHaplotypes=true saveReads=true -pacbio-hifi reads_hifi.fastq
Thanks in advance!