marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
660 stars 179 forks source link

Repeated OOM crashes with up to 1t memory #2279

Closed barke042 closed 11 months ago

barke042 commented 1 year ago

Hi, I'm running canu 2.2 on a partitioned slurm grid. I have gotten to the consensus step but I'm consistently getting an issue at (I believe) the consensus step. The script calls 'utgcns' which starts to work on creating the ctgStore, but then crashes with an OOM error:

ERROR:
ERROR:  Failed with exit code 137.  (rc=35072)
ERROR:

ABORT:
ABORT: canu 2.2
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT:   failed to partition the reads.
ABORT:
ABORT: Disk space available:  232398.504 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (unitigging/DumCar.ctgStore/partitioning.log):
ABORT:
ABORT:   -- Opening seqStore '../DumCar.seqStore'.
ABORT:   -- Using corrected-trimmed reads.
ABORT:   -- Opening tigStore './DumCar.ctgStore' version 1.
ABORT:
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=160360817.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

I am submitting this stage to a partition with 8 processors and have tried up to 125g memory per processor with the same result. The data are 23g of PacBio HiFi reads representing ~25X coverage of an ~1.2 Gb genome. I know that it is at the low end for coverage but I'm giving it a try before generating more data.

I noticed that the utgcns call did not include a '-threads' argument, so I also tried running the step directly with that added, but got the same result. I am guessing something may have gone awry at an earlier stage, but I'm not sure how to troubleshoot.

Please let me know what additional outputs (if any) you'd like to see.

skoren commented 1 year ago

What are the stats for the failed job (160360817)? How much memory did it request and how much memory did it use when it was killed? What's the canu report look like up to this point?

barke042 commented 1 year ago

Here is the sacct information:

JobID        JobIDRaw        JobName  Partition  MaxVMSize  MaxVMSizeNode  MaxVMSizeTask  AveVMSize     MaxRSS MaxRSSNode MaxRSSTask     AveRSS MaxPages MaxPagesNode   MaxPagesTask   AvePages     MinCPU MinCPUNode MinCPUTask     AveCPU   NTasks  AllocCPUS    Elapsed      State ExitCode AveCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov     ReqMem ConsumedEnergy  MaxDiskRead MaxDiskReadNode MaxDiskReadTask    AveDiskRead MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask   AveDiskWrite    ReqTRES  AllocTRES TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutAve TRESUsageOutTot 
------------ ------------ ---------- ---------- ---------- -------------- -------------- ---------- ---------- ---------- ---------- ---------- -------- ------------ -------------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ---------- ---------- -------- ---------- ------------- ------------- ------------- ---------- -------------- ------------ --------------- --------------- -------------- ------------ ---------------- ---------------- -------------- ---------- ---------- -------------- -------------- ------------------ ------------------ -------------- ------------------ ------------------ -------------- --------------- ------------------- ------------------- --------------- --------------- 
160360817    160360817      canu3.sc      ram1t                                                                                                                                                                                                               8   00:04:09 OUT_OF_ME+    0:125                  Unknown       Unknown       Unknown      1000M              0                                                                                                                           cpu=8,mem+ cpu=8,mem+                                                                                                                                                                                                                                 
160360817.b+ 160360817.b+      batch               507000K         cn2015              0    507000K    428896K     cn2015          0    428896K        1       cn2015              0          1   00:03:53     cn2015          0   00:03:53        1          8   00:04:09 OUT_OF_ME+    0:125     16.18M             0             0             0                         0      516.96M          cn2015               0        516.96M        0.07M           cn2015                0          0.07M            cpu=8,mem+ cpu=00:03:53,+ cpu=00:03:53,+ cpu=cn2015,energy+ cpu=0,fs/disk=0,m+ cpu=00:03:53,+ cpu=cn2015,energy+ cpu=0,fs/disk=0,m+ cpu=00:03:53,+ energy=0,fs/di+ energy=cn2015,fs/d+           fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ 
160360817.e+ 160360817.e+     extern               146820K         cn2015              0    146820K      1168K     cn2015          0      1168K        0       cn2015              0          0   00:00:00     cn2015          0   00:00:00        1          8   00:04:09 OUT_OF_ME+    0:125      2.76G             0             0             0                         0        0.00M          cn2015               0          0.00M        0.00M           cn2015                0          0.00M            cpu=8,mem+ cpu=00:00:00,+ cpu=00:00:00,+ cpu=cn2015,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ cpu=cn2015,energy+ cpu=0,fs/disk=0,m+ cpu=00:00:00,+ energy=0,fs/di+ energy=cn2015,fs/d+           fs/disk=0 energy=0,fs/di+ energy=0,fs/di+ 

Here is the entire outfile:

-- canu 2.2
--
-- CITATIONS
--
-- For assemblies of PacBio HiFi reads:
--   Nurk S, Walenz BP, Rhiea A, Vollger MR, Logsdon GA, Grothe R, Miga KH, Eichler EE, Phillippy AM, Koren S.
--   HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads.
--   biorXiv. 2020.
--   https://doi.org/10.1101/2020.03.14.992248
-- 
-- Read and contig alignments during correction and consensus use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_101' (from 'java') with -d64 support.
-- Detected gnuplot version '4.6 patchlevel 2   ' (from 'gnuplot') and image format 'png'.
--
-- Detected 8 CPUs and 1000 gigabytes of memory on the local machine.
--
-- Detected Slurm with 'sinfo' binary in /usr/bin/sinfo.
-- Detected Slurm with task IDs up to 100000 allowed.
-- 
-- Slurm support detected.  Resources available:
--     16 hosts with  32 cores and 1007 GB memory.
--     35 hosts with 128 cores and  503 GB memory.
--     40 hosts with  24 cores and  125 GB memory.
--      1 host  with  24 cores and  754 GB memory.
--     90 hosts with 128 cores and  251 GB memory.
--     10 hosts with  24 cores and   61 GB memory.
--     11 hosts with  24 cores and  376 GB memory.
--      9 hosts with 128 cores and 2003 GB memory.
--     31 hosts with  24 cores and  251 GB memory.
--    583 hosts with  24 cores and   62 GB memory.
--
--                         (tag)Threads
--                (tag)Memory         |
--        (tag)             |         |  algorithm
--        -------  ----------  --------  -----------------------------
-- Grid:  meryl     25.000 GB    8 CPUs  (k-mer counting)
-- Grid:  hap       16.000 GB   24 CPUs  (read-to-haplotype assignment)
-- Grid:  cormhap   16.000 GB    8 CPUs  (overlap detection with mhap)
-- Grid:  obtovl    16.000 GB    8 CPUs  (overlap detection)
-- Grid:  utgovl    16.000 GB    8 CPUs  (overlap detection)
-- Grid:  cor        -.--- GB    4 CPUs  (read correction)
-- Grid:  ovb        4.000 GB    1 CPU   (overlap store bucketizer)
-- Grid:  ovs       32.000 GB    1 CPU   (overlap store sorting)
-- Grid:  red       16.000 GB    8 CPUs  (read error detection)
-- Grid:  oea        8.000 GB    1 CPU   (overlap error adjustment)
-- Grid:  bat      256.000 GB   16 CPUs  (contig construction with bogart)
-- Grid:  cns        -.--- GB    8 CPUs  (consensus)
--
-- Found PacBio HiFi reads in 'DumCar.seqStore':
--   Libraries:
--     PacBio HiFi:           1
--   Reads:
--     Corrected:             30074477818
--     Corrected and Trimmed: 30074477818
--
--
-- Generating assembly 'DumCar' in '/scratch.global/barke042/DumCar_v3':
--   genomeSize:
--     1200000000
--
--   Overlap Generation Limits:
--     corOvlErrorRate 0.0000 (  0.00%)
--     obtOvlErrorRate 0.0250 (  2.50%)
--     utgOvlErrorRate 0.0100 (  1.00%)
--
--   Overlap Processing Limits:
--     corErrorRate    0.0000 (  0.00%)
--     obtErrorRate    0.0250 (  2.50%)
--     utgErrorRate    0.0003 (  0.03%)
--     cnsErrorRate    0.0500 (  5.00%)
--
--   Stages to run:
--     assemble HiFi reads.
--
--
-- Correction skipped; not enabled.
--
-- Trimming skipped; not enabled.
--
-- BEGIN ASSEMBLY
----------------------------------------
-- Starting command on Mon Nov 27 10:14:44 2023 with 232430.446 GB free disk space

    cd unitigging
    /panfs/jay/groups/25/barkerfk/shared/canu-2.2/build/bin/utgcns \
      -S ../DumCar.seqStore \
      -T  ./DumCar.ctgStore 1 \
      -partition 0.8 1.5 0.1 \
    > ./DumCar.ctgStore/partitioning.log 2>&1
sh: line 4: 2947357 Killed                  /panfs/jay/groups/25/barkerfk/shared/canu-2.2/build/bin/utgcns -S ../DumCar.seqStore -T ./DumCar.ctgStore 1 -partition 0.8 1.5 0.1 > ./DumCar.ctgStore/partitioning.log 2>&1

-- Finished on Mon Nov 27 10:18:45 2023 (241 seconds) with 232398.504 GB free disk space
----------------------------------------

ERROR:
ERROR:  Failed with exit code 137.  (rc=35072)
ERROR:

ABORT:
ABORT: canu 2.2
ABORT: Don't panic, but a mostly harmless error occurred and Canu stopped.
ABORT: Try restarting.  If that doesn't work, ask for help.
ABORT:
ABORT:   failed to partition the reads.
ABORT:
ABORT: Disk space available:  232398.504 GB
ABORT:
ABORT: Last 50 lines of the relevant log file (unitigging/DumCar.ctgStore/partitioning.log):
ABORT:
ABORT:   -- Opening seqStore '../DumCar.seqStore'.
ABORT:   -- Using corrected-trimmed reads.
ABORT:   -- Opening tigStore './DumCar.ctgStore' version 1.
ABORT:
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=160360817.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

The job was resubmitted after the first crash as: sbatch -p ram1t --cpus-per-task=8 --mem-per-cpu=125 canu3.sc

Where canu3.sc looks like:

#!/bin/bash -l        
#SBATCH --time=96:00:00
#SBATCH --mail-type=ALL  
#SBATCH --mail-user=my_email@blah.edu 
cd /scratch.global/my_directory
/home/barkerfk/shared/canu-2.2/build/bin/canu -p DumCar -d DumCar_v3 genomesize=1.2g \
gridOptions="--time=96:00:00" \
gridOptionsMERYL="-p small --cpus-per-task=8 --mem-per-cpu=5g" \
gridOptionsHAP="-p ram1t --cpus-per-task=24 --mem-per-cpu=2g" \
gridOptionsCORMHAP="-p ram1t --cpus-per-task=8 --mem-per-cpu=2g" \
gridOptionsOBTOVL="-p ram1t --cpus-per-task=8 --mem-per-cpu=2g" \
gridOptionsUTGOVL="-p ram1t --cpus-per-task=8 --mem-per-cpu=2g" \
gridOptionsCOR="-p ram1t --cpus-per-task=8 --mem-per-cpu=16g" \
gridOptionsOVB="-p small --cpus-per-task=1 --mem-per-cpu=4g" \
gridOptionsOVS="-p ram1t --cpus-per-task=1 --mem-per-cpu=32g" \
gridOptionsRED="-p small --cpus-per-task=8 --mem-per-cpu=2g" \
gridOptionsOEA="-p small --cpus-per-task=1 --mem-per-cpu=8g" \
gridOptionsBAT="-p ram1t --cpus-per-task=16 --mem-per-cpu=16g" \
gridOptionsCNS="-p ram1t --cpus-per-task=8 --mem-per-cpu=125g" \
-pacbio-hifi hifi_reads.fastq.gz
skoren commented 1 year ago

The sacct logs indicate it's being killed after using only about 500mb. I think because you're missing the g on 125 it's requesting 8*125mb not 8*125gb. Can you try re-launching with sbatch -p ram1t --cpus-per-task=8 --mem-per-cpu=125g canu3.sc

barke042 commented 1 year ago

OMG I contacted the developer over a typo.

brianwalenz commented 1 year ago

Not the first time a typo has wreaked havoc....

We don't encourage setting cpus or memory explicitly in the gridOptions since the algorithms don't know about these limits and can configure themselves to exceed them. Instead of gridOptionsMERYL="-p small --cpus-per-task=8 --mem-per-cpu=5g" use merylMemory=40g merylThreads=8 to tell the algorithms their limits; canu will supply the correct --cpus-per-task and --mem-per-cpu when jobs of that type are submitted.

For the canu process itself, use executiveMemory (default is 4GB) and executiveThreads. Canu resubmits itself as it runs, forgetting about the limits set in the initial sbatch command.

Finally, if you stick all this boring stuff in ~/.canu the options will be set on any future runs.

barke042 commented 1 year ago

Thanks @brianwalenz for that information, it is very useful. I can then restrict gridOptions entries to the appropriate partitions that have the required resources.

On other fronts, we seem to have finished up generating the ctgStore (a total of 30494 files), but now we are crashing at the next step.

-- BEGIN ASSEMBLY
-- Using slow alignment for consensus (iteration '0').
-- Configured 30494 consensus jobs.
--
-- Grid:  cns        0.500 GB    8 CPUs  (consensus)
--
--
-- Running jobs.  First attempt out of 2.
--
-- Failed to submit compute jobs.  Delay 10 seconds and try again.

CRASH:
CRASH: canu 2.2
CRASH: Please panic, this is abnormal.
CRASH:
CRASH:   Failed to submit compute jobs.
CRASH:
CRASH: Failed at /panfs/jay/groups/25/barkerfk/shared/canu-2.2/build/bin/../lib/site_perl/canu/Execution.pm line 1259.
CRASH:  canu::Execution::submitOrRunParallelJob('DumCar', 'cns', 'unitigging/5-consensus', 'consensus', 1, 2, 3, 4, 5, ...) called at /panfs/jay/groups/25/barkerfk/shared/canu-2.2/build/bin/../lib/site_perl/canu/Consensus.pm line 352
CRASH:  canu::Consensus::consensusCheck('DumCar') called at /home/barkerfk/shared/canu-2.2/build/bin/canu line 1135
CRASH: 
CRASH: Last 50 lines of the relevant log file (unitigging/5-consensus/consensus.jobSubmit-01.out):
CRASH:
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Slurm temporarily unable to accept job, sleeping and retrying
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Setting account: barkerfk
CRASH: sbatch: error: Batch job submission failed: Resource temporarily unavailable
CRASH:

This is because the job submission script is submitting one job per partition file in ctgStore, but our slurm setup here only allows up to 2000 jobs to be submitted at a time.

Should I rewrite the job submission script "consensus.jobSubmit-01.sh" to split up the sbatch commands into groups of 2000, then submit a slurm job using that script? I assume I would then need a final call to canu to finish things up?

skoren commented 1 year ago

Yes, in those cases you can use useGrid=remote to manually submit the jobs. You can also specify the max size of each array job using gridEngineArrayMaxJobs, this is normally auto-detected from slurm config though. If there no array limit but a job limit you might run into an issue like this.

That said, this looks like you have a pretty bad assembly, there are over 30k contigs which is not usual. I'm guessing this is because of your low coverage. The *.report file in the assembly folder should give more details on what the data/assembly look like. Unless you really want this low-coverage assembly, I wouldn't worry about completing it, you're unlikely to end up with over 2000 consensus jobs in a real run.

barke042 commented 1 year ago

Uffda. I wonder if it is because of the heterozygosity of the individual sequenced. Based on another assembly, we are estimating ~0.003 heterozygosity. Our first assembly with Flye came out almost completely as haplotigs (twice the the estimated genome size with half the expected coverage). Maybe it would be better to start over with a higher expected error rate?

skoren commented 1 year ago

The 0.003 heterozygosity seems low, have you looked at the k-mer spectrum from the HiFi data using genomescope?

There's really no way to increase the error rate for structural differences, they will still lead to the assembly splitting the haplotypes. Increasing the error rate would also hurt your repeat resolution since you can't tell the difference between near repeats if you're ignoring those types of differences between haplotypes.

skoren commented 11 months ago

Idle