marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

How much resources should be set in slurm script? #2085

Closed ZexuanZhao closed 2 years ago

ZexuanZhao commented 2 years ago

Hi !

I'm trying to run Canu on HPC and since Canu can "query the system for grid support, configure itself for the machines available in the grid, then submit itself to the grid for execution" I wonder how much resources should be defined in the slurm script that launched the Canu.

I tried setting running time to be 15min just to test and when time's up it's still running. It seems that the Canu is out of my control. I killed all process and here's the log:

Found perl:
   /usr/bin/perl
   This is perl 5, version 26, subversion 3 (v5.26.3) built for x86_64-linux-thread-multi

Found java:
   /usr/bin/java
   openjdk version "1.8.0_282"

Found canu:
   /homes/zzhao127/packages/canu-2.2/bin/canu
   canu 2.2

-- canu 2.2
--
-- CITATIONS
--
-- For 'standard' assemblies of PacBio or Nanopore reads:
--   Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
--   Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
--   Genome Res. 2017 May;27(5):722-736.
--   http://doi.org/10.1101/gr.215087.116
--
-- Read and contig alignments during correction and consensus use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
--
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
--
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
--
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
--
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
--
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_282' (from 'java') with -d64 support.
--
-- WARNING:
-- WARNING:  Failed to run gnuplot using command 'gnuplot'.
-- WARNING:  Plots will be disabled.
-- WARNING:
--
--
-- Detected 1 CPUs and 1024 gigabytes of memory on the local machine.
--
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with task IDs up to 1000 allowed.
--
-- Slurm support detected.  Resources available:
--    487 hosts with  20 cores and  124 GB memory.
--      5 hosts with  40 cores and  995 GB memory.
--      1 host  with  16 cores and   62 GB memory.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     15.000 GB    4 CPUs  (k-mer counting)
-- Grid:  hap       12.000 GB   10 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   20.000 GB    5 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    16.000 GB    5 CPUs  (overlap detection)
-- Grid:  utgovl    16.000 GB    5 CPUs  (overlap detection)
-- Grid:  cor        -.--- GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       16.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       20.000 GB    5 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      256.000 GB   16 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    8 CPUs  (consensus)
--
-- Found PacBio CLR reads in 'all_pacbio.seqStore':
--   Libraries:
--     PacBio CLR:            1
--   Reads:
--     Raw:                   43741444067
--
--
-- Generating assembly 'all_pacbio' in '/homes/zzhao127/results/canu/all_pacbio':
--   genomeSize:
--     530000000
--
--   Overlap Generation Limits:
--     corOvlErrorRate 0.2400 ( 24.00%)
--     obtOvlErrorRate 0.0450 (  4.50%)
--     utgOvlErrorRate 0.0450 (  4.50%)
--
--   Overlap Processing Limits:
--     corErrorRate    0.2500 ( 25.00%)
--     obtErrorRate    0.0450 (  4.50%)
--     utgErrorRate    0.0450 (  4.50%)
--     cnsErrorRate    0.0750 (  7.50%)
--
--   Stages to run:
--     correct raw reads.
--     trim corrected reads.
--     assemble corrected and trimmed reads.
--
--
-- BEGIN CORRECTION
--
-- Kmer counting (meryl-count) jobs failed, retry.
--   job all_pacbio.01.meryl FAILED.
--   job all_pacbio.02.meryl FAILED.
--   job all_pacbio.03.meryl FAILED.
--   job all_pacbio.04.meryl FAILED.
--   job all_pacbio.05.meryl FAILED.
--   job all_pacbio.06.meryl FAILED.
--
--
-- Running jobs.  Second attempt out of 2.
--
-- 'meryl-count.jobSubmit-01.sh' -> job 18163383 tasks 1-6.
--
----------------------------------------
-- Starting command on Mon Feb 14 14:51:32 2022 with 495.22 GB free disk space

    cd /homes/zzhao127/results/canu/all_pacbio
    sbatch \
      --depend=afterany:18163383 \
      --cpus-per-task=1 \
      --mem-per-cpu=5g   \
      -D `pwd` \
      -J 'canu_all_pacbio' \
      -o canu-scripts/canu.02.out  canu-scripts/canu.02.sh
Submitted batch job 18163384

-- Finished on Mon Feb 14 14:51:32 2022 (furiously fast) with 495.22 GB free disk space
----------------------------------------

Here's the slurm script:

#!/bin/bash
# The line above this is the "shebang" line.  It must be first line in script
#-----------------------------------------------------
#   Run canu on Pacbio non-hifi reads: M1.pacbio.non_hifi.fastq.gz
#   Submit to deepthought2 
#   Author: Zexuan Zhao
#   Email: zzhao127@umd.edu
#-----------------------------------------------------
#
# Slurm sbatch parameters section:
#   Request a single task using 5 CPU core
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=5
#   Request 15 minute of walltime
#SBATCH -t 0-0:15:0
#   Request 10 GB of memory for the job
#SBATCH --mem=1024
#   Allow other jobs to run on same node
#SBATCH --oversubscribe
#   Run on debug partition for rapid turnaround.  You will need
#   to change this (remove the line) if walltime > 15 minutes
#   Deleted: SBATCH --partition=debug
#   Do not inherit the environment of the process running the
#   sbatch command.  This requires you to explicitly set up the
#   environment for the job in this script, improving reproducibility
#SBATCH --export=NONE
#

# Section to ensure we have the "module" command defined
unalias tap >& /dev/null
if [ -f ~/.bash_profile ]; then
    source ~/.bash_profile
elif [ -f ~/.profile ]; then
    source ~/.profile
fi

# Set SLURM_EXPORT_ENV to ALL.  This prevents the --export=NONE flag
# from being passed to mpirun/srun/etc, which can cause issues.
# We want the environment of the job script to be passed to all 
# tasks/processes of the job
export SLURM_EXPORT_ENV=ALL

# Module load section
# First clear our module list 
module purge
# and reload the standard modules
module load hpcc/deepthought2

# Section to make a scratch directory for this job
# For sequential jobs, local /tmp filesystem is a good choice
# We include the SLURM jobid in the #directory name to avoid interference if 
# multiple jobs running at same time.
TMPWORKDIR="/tmp/ood-job.${SLURM_JOBID}"
mkdir $TMPWORKDIR
cd $TMPWORKDIR

# Section to output information identifying the job, etc.
echo "Slurm job ${SLURM_JOBID} running on"
hostname
echo "To run in ${SLURM_CPUS_PER_TASK} tasks on a single nodes"
echo "All nodes: ${SLURM_JOB_NODELIST}"
date
pwd
echo "Loaded modules are:"
module list

# Run our code, giving -t 0 to use all available CPUs (15 in this case)
/homes/zzhao127/packages/canu-2.2/bin/canu \
 -p all_pacbio \
 -d /homes/zzhao127/results/canu/all_pacbio \
 genomeSize=530M \
 -pacbio /lustre/zzhao127/M1.pacbio.non_hifi.fastq.gz

# Save the exit code from the previous command
ECODE=$?

# Copy results back to submit dir
cp -r * ${SLURM_SUBMIT_DIR}

echo "Job finished with exit code $ECODE"
date

# Exit with the cached exit code
exit $ECODE

Thank you.

skoren commented 2 years ago

The canu job just submits to the grid and exits: https://canu.readthedocs.io/en/latest/faq.html#how-do-i-run-canu-on-my-slurm-sge-pbs-lsf-torque-system so you don't need to assign any resources to it, we usually launch it on the head node. You can see this in yours logs and this is why you see jobs running after 15 minutes. They're not your original job but subsequent submissions.

You can keep it on a single node by specifying useGrid=false (not recommended for a 500gb genome) or you can pass though any slurm options you want (like time) using gridOptions="--time 0-0:15:0"