marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
660 stars 179 forks source link

grid options, meryl-count and disk quota #2309

Closed brunocontrerasmoreira closed 7 months ago

brunocontrerasmoreira commented 7 months ago

Hi, I am testing canu in a slurm Linux cluster for the first time with 2.5TB HiFi compressed reads. This is the bash script I submitted with sbatch:

#SBATCH --mem=4G
#SBATCH --time=6-24:00:00
module load java
$HOME/soft/canu-2.2/build/bin/canu -p Tt -d Tt genomeSize=17g useGrid=true gridOptions='--mem-per-cpu=24G' -pacbio-hifi $HOME/fastq/*

The stderr of this job contains:

    ... 
    -- Slurm support detected.  Resources available:
    --    126 hosts with  52 cores and  182 GB memory.
    --     24 hosts with 128 cores and 1854 GB memory.
    --      4 hosts with 128 cores and  878 GB memory.
    --    155 hosts with 128 cores and  438 GB memory.
    --     34 hosts with 256 cores and  683 GB memory.
    --
    --                         (tag)Threads
    --                (tag)Memory         |
    --        (tag)             |         |  algorithm
    --        -------  ----------  --------  -----------------------------
    -- Grid:  meryl     24.000 GB    8 CPUs  (k-mer counting)
    -- Grid:  hap       16.000 GB   16 CPUs  (read-to-haplotype assignment)
    -- Grid:  cormhap   42.000 GB   16 CPUs  (overlap detection with mhap)
    -- Grid:  obtovl    24.000 GB   16 CPUs  (overlap detection)
    -- Grid:  utgovl    24.000 GB   16 CPUs  (overlap detection)
    -- Grid:  cor        -.--- GB    4 CPUs  (read correction)
    -- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
    -- Grid:  ovs       32.000 GB    1 CPU   (overlap store sorting)
    -- Grid:  red       32.000 GB   10 CPUs  (read error detection)
    -- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
    -- Grid:  bat      1024.000 GB   64 CPUs  (contig construction with bogart)
    -- Grid:  cns        -.--- GB    8 CPUs  (consensus)
    --
    -- Found PacBio HiFi reads in 'Apin.seqStore':
    --   Libraries:
    --     PacBio HiFi:           20
    --   Reads:
    --     Corrected:             3400000015033
    --     Corrected and Trimmed: 3400000015033
    ...
    -- BEGIN ASSEMBLY
    --
    -- Running jobs.  First attempt out of 2.
    --
    -- 'meryl-count.jobSubmit-01.sh' -> job 7050352 tasks 1-96.

However, the meryl-count jobs fail; here's the last line of meryl-count.7051213_65.out:

Failed to open './Apin.65.meryl.WORKING/0x001110[066].merylData' for writing: Disk quota exceeded

When I checked the folder where this job was running I see a large number of files:

ls Apin.65.meryl.WORKING | wc -l
8334  

How can I change the slurm settings to:

Thanks for your help

davidcb98 commented 7 months ago

Good morning,

I am a member of the supercomputing center where Bruno is running this program. I add one more question to this thread:

is there a way to have these temporary files generated on the local scratch of the nodes of a supercomputer?

Regards, David

brunocontrerasmoreira commented 7 months ago

I found out that meryl was being invoked with max 21G of RAM despite me allowing more RAM in gridOptions:

/path/to/canu-2.2/build/bin/meryl k=22 threads=8 memory=21 \
    count \
    segment=$jobid/96 ../../Apin.seqStore \
    output ./Apin.$jobid.meryl.WORKING \

If I edit this script and increase to memory=128G then the number of temp files of every array job was <100

skoren commented 7 months ago

The gridOptions is just passed through, it shouldn't be used to request resources as canu does that automatically on a per-job basis. See https://canu.readthedocs.io/en/latest/parameter-reference.html# for more details. You can specify meryl memory and threads instead which would update the above script and would also request the approximate memory from the grid. You can also limit concurrent jobs by modifying grid array parameters (gridEngineArrayOption="-a ARRAY_JOBS%4" on slurm would limit canu to at most 4 concurrent jobs).

As for local disk, there is an option for staging https://canu.readthedocs.io/en/latest/parameter-reference.html#file-staging but it isn't used for this step as it's usually not an I/O issue compared to later steps. If it's running out of space already here, it's likely you'll need significantly more disk space. A human genome w/HiFi at 40x requires about 200gb to compute, given your genome is much larger and likely more repetitive, I'd count on at least 2 tb of space being available to run.

brunocontrerasmoreira commented 7 months ago

Hi @skoren , we managed to get the meryl-count job done by increasing disk quota and giving it more RAM. The resulting folder 0-mercounts/ takes 3.4T of disk, can this help estimate how much disk space we need for the remaining jobs? Thanks