marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
655 stars 179 forks source link

Recommended coverage for HiCanu #2000

Closed fjruizruano closed 3 years ago

fjruizruano commented 3 years ago

Hi!

I am running HiCanu using a library with a coverage around 80x for a genome of around 1.5 Gb. However, it is still running the step "utgovl" (folder unitigging/1-overlapper) for weeks. So, I wonder if I should use a lower coverage. This is the command I run:

canu -p asm_si -d asm_si useGrid=false genomeSize=1500m saveReadCorrections=true saveReadHaplotypes=true saveReads=true -pacbio-hifi reads_hifi.fastq

Thanks in advance!

skoren commented 3 years ago

The default HiFi settings will downsample to 50x coverage. It's hard to say how long it should take without more info on what kind of system you're using. Post the log from the Canu run which will list how many jobs are running at a time and how many cores are in use.

fjruizruano commented 3 years ago

Thanks for your answer. This is the log got:

-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_292' (from 'java') with -d64 support.
-- Detected gnuplot version '4.6 patchlevel 2   ' (from 'gnuplot') and image format 'png'.
-- Detected 20 CPUs and 125 gigabytes of memory.
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Grid engine and staging disabled per useGrid=false option.
--
--                                (tag)Concurrency
--                         (tag)Threads          |
--                (tag)Memory         |          |
--        (tag)             |         |          |       total usage      algorithm
--        -------  ----------  --------   --------  --------------------  -----------------------------
-- Local: meryl     31.000 GB    5 CPUs x   4 jobs   124.000 GB  20 CPUs  (k-mer counting)
-- Local: hap       16.000 GB   20 CPUs x   1 job     16.000 GB  20 CPUs  (read-to-haplotype assignment)
-- Local: cormhap   32.000 GB   10 CPUs x   2 jobs    64.000 GB  20 CPUs  (overlap detection with mhap)
-- Local: obtovl    16.000 GB   10 CPUs x   2 jobs    32.000 GB  20 CPUs  (overlap detection)
-- Local: utgovl    16.000 GB   10 CPUs x   2 jobs    32.000 GB  20 CPUs  (overlap detection)
-- Local: cor       24.000 GB    4 CPUs x   5 jobs   120.000 GB  20 CPUs  (read correction)
-- Local: ovb        4.000 GB    1 CPU  x  20 jobs    80.000 GB  20 CPUs  (overlap store bucketizer)
-- Local: ovs       32.000 GB    1 CPU  x   3 jobs    96.000 GB   3 CPUs  (overlap store sorting)
-- Local: red       31.000 GB    5 CPUs x   4 jobs   124.000 GB  20 CPUs  (read error detection)
-- Local: oea        8.000 GB    1 CPU  x  15 jobs   120.000 GB  15 CPUs  (overlap error adjustment)
-- Local: bat      125.000 GB   16 CPUs x   1 job    125.000 GB  16 CPUs  (contig construction with bogart)
-- Local: cns        -.--- GB    8 CPUs x   - jobs     -.--- GB   - CPUs  (consensus)
--
-- In 'asm_si.seqStore', found PacBio HiFi reads:
--   PacBio HiFi:              1
--
--   Corrected:                1
--   Corrected and Trimmed:    1
--
-- Generating assembly 'asm_si' in '/crex/proj/sllstore2017073/private/GRC/Paco/sumk/canu/asm_si':
--    - assemble HiFi reads.
--
-- Parameters:
--
--  genomeSize        1500000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.0000 (  0.00%)
--    obtOvlErrorRate 0.0250 (  2.50%)
--    utgOvlErrorRate 0.0100 (  1.00%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.0000 (  0.00%)
--    obtErrorRate    0.0250 (  2.50%)
--    utgErrorRate    0.0100 (  1.00%)
--    cnsErrorRate    0.0500 (  5.00%)
--
--
-- BEGIN ASSEMBLY
--
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'utgovl' concurrent execution on Mon Jul 19 22:57:59 2021 with 498058.86 GB free disk space (334 processes; 2 concurrently)

    cd unitigging/1-overlapper
    ./overlap.sh 47 > ./overlap.000047.out 2>&1
    ./overlap.sh 49 > ./overlap.000049.out 2>&1
    ./overlap.sh 50 > ./overlap.000050.out 2>&1
    ./overlap.sh 51 > ./overlap.000051.out 2>&1
    ./overlap.sh 52 > ./overlap.000052.out 2>&1
    ./overlap.sh 53 > ./overlap.000053.out 2>&1
    ./overlap.sh 54 > ./overlap.000054.out 2>&1
    ./overlap.sh 55 > ./overlap.000055.out 2>&1
    ./overlap.sh 56 > ./overlap.000056.out 2>&1
    ./overlap.sh 57 > ./overlap.000057.out 2>&1
    ./overlap.sh 58 > ./overlap.000058.out 2>&1
    ./overlap.sh 59 > ./overlap.000059.out 2>&1
    ./overlap.sh 60 > ./overlap.000060.out 2>&1
    ./overlap.sh 61 > ./overlap.000061.out 2>&1
    ./overlap.sh 62 > ./overlap.000062.out 2>&1
    ./overlap.sh 63 > ./overlap.000063.out 2>&1
skoren commented 3 years ago

Using only 20 cores isn't ideal. A typically human assembly takes about 2k cpu hours which would be about a week on 20 cores. It can take longer on large or repetitive genomes. If you have access to the cluster, you can let Canu submit sjobs which will allow more than 2 jobs to run concurrently and will speed up the run. You'd have to remove the 1-overlapper/overlap.sh file and re-start removing useGrid=false.

Another option would be to check the repeat k-mer threshold. You can check the current threshold by looking for the smallest value in unitigging/0-mercounts/*.dump, sorting by the second column and piping to head will work. If it's over about 250-500 you can lower it using either an explicit one utgOvlMerThreshold=300 or a fractional one utgOvlMerDistinct=0.97. Doing this would require you to start the assembly from scratch but individual jobs should run faster.

fjruizruano commented 3 years ago

Thanks a lot for your help! I will follow your suggestions. Best.