marbl / canu

A single molecule sequence assembler for genomes large and small.
http://canu.readthedocs.io/
654 stars 179 forks source link

Unsure if this indicates memory leak in `splitHaplotype` #1607

Closed SHuang-Broad closed 2 months ago

SHuang-Broad commented 4 years ago

I've been trying to build our cloud-native pipelines using trioCanu, and think this might be useful for the canu developers.

Show below is the resource usage during splitHaplotype. splitHaplotype resources

Note that the memory usage "drifted" higher from the first to the last "hump"s. Is this expected? If unexpected, I'm not sure how high a priority this is for the team (not a high priority for us), but I think this might be useful in case you're interested in looking into it.

I've attached the shell script, the out file and the log below. Archive.zip

brianwalenz commented 4 years ago

Nice! I think this is expected.

The kmer database is split into 64 pieces, which can be loaded in parallel, one per thread. There are two databases to load - 128 total pieces - so four humps are expected when 36 threads are available. The first two humps should be the first database, the second two the second. The growth is hopefully just the kmers being loaded. Reasonably expected.

The humps themselves are probably overhead in unpacking the compressed kmer data. I'm a little surprised at how big that is...but is seems to be released when done, so no damage.

Do you know what the first 16 minutes was doing? Just downloading data? 36 cores seems overkill here. I'd expect using 16 cores, or maybe even just 8, would be better. It looks like 20 minutes of loading data for a one minute compute. For full impact, I'd have to support computing with 8 threads, but loading all 64 pieces in parallel. Easy to do, just need another command line option (that I might not expose to canu though).

SHuang-Broad commented 4 years ago

@brianwalenz Sorry it took me so long to get back to you! (multiple projects...)

It's relieving to know this is expected!


Do you know what the first 16 minutes was doing? Just downloading data?

No, all data has been localized and put in the right place before canu is called.

Below is the relevant part of the log file. (Please bear in mind that this is my personal fork of canu 1.9, with only the Perl part of workflow control changed). You can see that the first minute is spent on configuring the child long read splitting job. Then not much log is available.

Wed Jan 22 16:13:12 UTC 2020
Limiting hap memory to: 29 G
-- Canu branch sh_fix_k_31 +25 changes (r9633 7e2fd9ae685e1c67848a973ef3ea7858d2ae1a0b)
--
-- CITATIONS
--
-- Koren S, Walenz BP, Berlin K, Miller JR, Phillippy AM.
-- Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.
-- Genome Res. 2017 May;27(5):722-736.
-- http://doi.org/10.1101/gr.215087.116
-- 
-- Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM.
-- De novo assembly of haplotype-resolved genomes with trio binning.
-- Nat Biotechnol. 2018
-- https//doi.org/10.1038/nbt.4277
-- 
-- Read and contig alignments during correction, consensus and GFA building use:
--   Šošic M, Šikic M.
--   Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance.
--   Bioinformatics. 2017 May 1;33(9):1394-1395.
--   http://doi.org/10.1093/bioinformatics/btw753
-- 
-- Overlaps are generated using:
--   Berlin K, et al.
--   Assembling large genomes with single-molecule sequencing and locality-sensitive hashing.
--   Nat Biotechnol. 2015 Jun;33(6):623-30.
--   http://doi.org/10.1038/nbt.3238
-- 
--   Myers EW, et al.
--   A Whole-Genome Assembly of Drosophila.
--   Science. 2000 Mar 24;287(5461):2196-204.
--   http://doi.org/10.1126/science.287.5461.2196
-- 
-- Corrected read consensus sequences are generated using an algorithm derived from FALCON-sense:
--   Chin CS, et al.
--   Phased diploid genome assembly with single-molecule real-time sequencing.
--   Nat Methods. 2016 Dec;13(12):1050-1054.
--   http://doi.org/10.1038/nmeth.4035
-- 
-- Contig consensus sequences are generated using an algorithm derived from pbdagcon:
--   Chin CS, et al.
--   Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data.
--   Nat Methods. 2013 Jun;10(6):563-9
--   http://doi.org/10.1038/nmeth.2474
-- 
-- CONFIGURE CANU
--
-- Detected Java(TM) Runtime Environment '1.8.0_232' (from '/usr/local/openjdk-8/bin/java') with -d64 support.
--
-- WARNING:
-- WARNING:  Failed to run gnuplot using command 'gnuplot'.
-- WARNING:  Plots will be disabled.
-- WARNING:
--
-- Detected 36 CPUs and 32 gigabytes of memory.
-- No grid engine detected, grid disabled.
--
--                            (tag)Concurrency
--                     (tag)Threads          |
--            (tag)Memory         |          |
--        (tag)         |         |          |     total usage     algorithm
--        -------  ------  --------   --------  -----------------  -----------------------------
-- Local: hap       29 GB   36 CPUs x   1 job     29 GB   36 CPUs  (read-to-haplotype assignment)
--
-- Found PacBio uncorrected reads in the input files.
--
-- Generating assembly 'giab' in '/cromwell_root/workdir'
--
-- Parameters:
--
--  genomeSize        3100000000
--
--  Overlap Generation Limits:
--    corOvlErrorRate 0.2400 ( 24.00%)
--    obtOvlErrorRate 0.0450 (  4.50%)
--    utgOvlErrorRate 0.0450 (  4.50%)
--
--  Overlap Processing Limits:
--    corErrorRate    0.3000 ( 30.00%)
--    obtErrorRate    0.0450 (  4.50%)
--    utgErrorRate    0.0450 (  4.50%)
--    cnsErrorRate    0.0750 (  7.50%)
--
-- USING k = 21
--
--
--
-- BEGIN ASSIGNING CHILD LONG READS
--
--
-- Finished stage 'haplotypeReadsConfigure', reset canuIteration.
-- New report created.
--
-- Running jobs.  First attempt out of 2.
----------------------------------------
-- Starting 'hap' concurrent execution on Wed Jan 22 16:14:30 2020 with 144.726 GB free disk space (1 processes; 1 concurrently)

    cd haplotype
    ./splitHaplotype.sh 1 > ./splitHaplotype.000001.out 2>&1

-- Finished on Wed Jan 22 20:43:42 2020 (16152 seconds) with 118.17 GB free disk space
----------------------------------------
-- Found 1 haplotyping outputs.
-- No change in report.
-- Finished stage 'haplotype-reads', reset canuIteration.
Stop requested after 'haplotype'.

The out file generated from running the shell script is here, but no time is given so I really don't know what's happening during the first 16 minutes.

Found perl:
   /usr/bin/perl
   This is perl 5, version 24, subversion 1 (v5.24.1) built for x86_64-linux-gnu-thread-multi

Found java:
   /usr/local/openjdk-8/bin/java
   Picked up _JAVA_OPTIONS: -Djava.io.tmpdir=/cromwell_root/tmp.e0308cfc

Found canu:
   /opt/canu/Linux-amd64/bin/canu
   Canu branch sh_fix_k_31 +25 changes (r9633 7e2fd9ae685e1c67848a973ef3ea7858d2ae1a0b)

Running job 1 based on command line options.
--
-- Loading haplotype data, using up to 14 GB memory for each.
--
--  Haplotype './0-kmers/haplotype-Father.meryl':
--   use kmers with frequency at least 11.

 p       prefixes             bits gigabytes (allowed: 14 GB)
-- -------------- ---------------- ---------
17         131072       5217764780     0.607
18         262144       5158499152     0.601
19         524288       5107622132     0.595
20        1048576       5073522328     0.591
21        2097152       5072976956     0.591 (smallest)
22        4194304       5139540448     0.598
23        8388608       5340321668     0.622
24       16777216       5809538344     0.676
25       33554432       6815625932     0.793
26       67108864       8895455344     1.036
27      134217728      13122768404     1.528
28      268435456      21645048760     2.520
29      536870912      38757263708     4.512
30     1073741824      73049347840     8.504 (used)
31     2147483648     141701170340    16.496
32     4294967296     279072469576    32.488
33     8589934592     553882722284    64.480
34    17179869184    1103570881936   128.473
-- -------------- ---------------- ---------

For 67654236 distinct 31-mers (with 30 bits used for indexing and 32 bits for tags):
    8.504 GB memory
    8.000 GB memory for index (1073741824 elements 64 bits wide)
    0.252 GB memory for tags  (67654236 elements 32 bits wide)
    0.252 GB memory for data  (67654236 elements 32 bits wide)

Will load 67654236 kmers.  Skipping 12511147671 (too low) and 0 (too high) kmers.
Allocating space for 67654236 suffixes of 32 bits each -> 2164935552 bits (0.252 GB) in blocks of 32.000 MB
                     67654236 values   of 32 bits each -> 2164935552 bits (0.252 GB) in blocks of 32.000 MB
Loaded 67654236 kmers.  Skipped 12511147671 (too low) and 0 (too high) kmers.
--   loaded 67654236 kmers.
--  Haplotype './0-kmers/haplotype-Mother.meryl':
--   use kmers with frequency at least 13.

 p       prefixes             bits gigabytes (allowed: 14 GB)
-- -------------- ---------------- ---------
16          65536       4434263272     0.516
17         131072       4381661820     0.510
18         262144       4333254672     0.504
19         524288       4293236132     0.500
20        1048576       4269994808     0.497 (smallest)
21        2097152       4280307916     0.498
22        4194304       4357729888     0.507
23        8388608       4569369588     0.532
24       16777216       5049444744     0.588
25       33554432       6066390812     0.706
26       67108864       8157078704     0.950
27      134217728      12395250244     1.443
28      268435456      20928389080     2.436
29      536870912      38051462508     4.430 (used)
30     1073741824      72354405120     8.423
31     2147483648     141017086100    16.417
32     4294967296     278399243816    32.410
33     8589934592     553220355004    64.403
-- -------------- ---------------- ---------

For 56795756 distinct 31-mers (with 29 bits used for indexing and 33 bits for tags):
    4.430 GB memory
    4.000 GB memory for index (536870912 elements 64 bits wide)
    0.218 GB memory for tags  (56795756 elements 33 bits wide)
    0.212 GB memory for data  (56795756 elements 32 bits wide)

Will load 56795756 kmers.  Skipping 12911257945 (too low) and 0 (too high) kmers.
Allocating space for 56795756 suffixes of 33 bits each -> 1874259948 bits (0.218 GB) in blocks of 32.000 MB
                     56795756 values   of 32 bits each -> 1817464192 bits (0.212 GB) in blocks of 32.000 MB
Loaded 56795756 kmers.  Skipped 12911257945 (too low) and 0 (too high) kmers.
--   loaded 56795756 kmers.
-- Data loaded.
--
-- Processing reads in batches of 100 reads each.
--
-- Bye.