marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
644 stars 177 forks source link

Decreasing the job size during trimming step (1-overlapper) to avoid exceeding time limit on cluster #2277

Closed AlWa1 closed 7 months ago

AlWa1 commented 7 months ago

Dear Canu team,

I am currently using canu to assembly a highly repetetive fungal genome (>2/3 repetitive elemnts, total genome size roughly 127 Mbp). Since our cluster (SLURM) does not support automatic resubmission on computing nodes, I am running the assembly in the useGrid=remote mode. Everything so far ran fine but now in the trimming step (overlapper) some of the individual batch jobs run longer than the currently provided wall-time of 12 hours (free access). 22 jobs finished in time while 42 jobs hit the wall limit. I am providing 113 cores (full node with Intel Sapphire Rapids) per job (manually adjusting the thread number in submission script and overlap.sh).

In order to avoid exceeding the 12 hours limit on our cluster, is there a way to decrease the size of the trimming/overlapper jobs to make each job finish in time ?

Many thanks in advance, Alan

Command: canu -p "scaffold" -d PATH/20231103_canu_onestep_test genomeSize=127m -raw -nanopore PATH/merged_fastq_pass_S2.fastq gridOptions="-A SL3-CPU -p sapphire -t 12:00:00 --mail-type=ALL" useGrid=remote


-- canu 2.2
--
-- CITATIONS
--
-- For 'standard' assemblies of PacBio or Nanopore reads:
--   Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
--   Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
--   Genome Res. 2017 May;27(5):722-736.
--   http://doi.org/10.1101/gr.215087.116
-- 

-- Read and contig alignments during correction and consensus use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
-- 
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_112' (from '/rds/user/aw932/hpc-work/software/miniconda3_V4/bin/java') with -d64 support.
-- Detected gnuplot version '5.4 patchlevel 5   ' (from 'gnuplot') and image format 'png'.
--
-- Detected 1 CPUs and 187 gigabytes of memory on the local machine.
--
-- Detected Slurm with 'sinfo' binary in /usr/local/software/slurm/current/bin/sinfo.
-- Detected Slurm with task IDs up to 9999 allowed.
-- 
-- Slurm support detected.  Resources available:
--      1 host  with  64 cores and 1005 GB memory.
--     19 hosts with   4 cores and    5 GB memory.
--     90 hosts with 128 cores and  999 GB memory.
--     56 hosts with  56 cores and  373 GB memory.
--    112 hosts with 112 cores and  500 GB memory.
--      1 host  with 104 cores and  502 GB memory.
--    484 hosts with  76 cores and  249 GB memory.
--    612 hosts with  56 cores and  186 GB memory.
--      1 host  with 112 cores and  999 GB memory.
--    136 hosts with  76 cores and  501 GB memory.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     13.000 GB    4 CPUs  (k-mer counting)
-- Grid:  hap       12.000 GB    8 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   13.000 GB    4 CPUs  (overlap detection with mhap)
-- Grid:  obtovl     8.000 GB    4 CPUs  (overlap detection)
-- Grid:  utgovl     8.000 GB    4 CPUs  (overlap detection)
-- Grid:  cor        -.--- GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs        8.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       13.000 GB    4 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat       64.000 GB    8 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    8 CPUs  (consensus)
--
-- Found Nanopore reads in 'scaffold.seqStore':
--   Libraries:
--     Nanopore:              1
--   Reads:
--     Raw:                   22024410204
--     Corrected:             7688548639
--
--
-- Generating assembly 'scaffold' in '/rds/project/ss2123/rds-ss2123-team_seb_storage/projects/20231103_canu_onestep_test':
--   genomeSize:
--     127000000
--
--   Overlap Generation Limits:
--     corOvlErrorRate 0.3200 ( 32.00%)
--     obtOvlErrorRate 0.1200 ( 12.00%)
--     utgOvlErrorRate 0.1200 ( 12.00%)
--
--   Overlap Processing Limits:
--     corErrorRate    0.3000 ( 30.00%)
--     obtErrorRate    0.1200 ( 12.00%)
--     utgErrorRate    0.1200 ( 12.00%)
--     cnsErrorRate    0.2000 ( 20.00%)
--
--   Stages to run:
--     trim corrected reads.
--     assemble corrected and trimmed reads.
--
--
-- Correction skipped; not enabled.
--
-- BEGIN TRIMMING
--
-- Running jobs.  First attempt out of 2.

Please run the following commands to submit tasks to the grid for execution.
Each task will use 8 gigabytes memory and 4 threads.

  cd /rds/project/ss2123/rds-ss2123-team_seb_storage/projects/20231103_canu_onestep_test/trimming/1-overlapper
  ./overlap.jobSubmit-01.sh
  ./overlap.jobSubmit-02.sh
  ./overlap.jobSubmit-03.sh
  ./overlap.jobSubmit-04.sh
  ./overlap.jobSubmit-05.sh
  ./overlap.jobSubmit-06.sh
  ./overlap.jobSubmit-07.sh
  ./overlap.jobSubmit-08.sh
  ./overlap.jobSubmit-09.sh
  ./overlap.jobSubmit-10.sh
  ./overlap.jobSubmit-11.sh

When all tasks are finished, restart canu as before.  The output of the grid
submit commands will be in *jobSubmit*out.
skoren commented 7 months ago

Yes there is though it's not very intuitive, you have to adjust the size of either the index for each job or number of reads it streams against the index (described here: https://canu.readthedocs.io/en/latest/parameter-reference.html#overlapper-configuration). This is the obtOvlRefBlockLength and utgOvlRefBlockLength parameters which default to 5000000000 for your genome size. If you drop this in half you double the jobs so you could try going to 1000000000 and see how many jobs you end up with and how long they take. You can also set ovlThreads=113 so you don't have to manually adjust it. Lastly, you could also try the -fast option which will speed this step up but might give you a slightly less continuous assembly.

skoren commented 7 months ago

Idle

AlWa1 commented 7 months ago

Amazing, that worked perfectly - many thanks!