mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
789 stars 167 forks source link

How long to give 'sorting kmer index' with 32 threads #280

Closed plattsad closed 4 years ago

plattsad commented 4 years ago

Hi. This may be normal as I'm assembling a fairly large genome (2.2gbase expected, may be as high as 4.4g if the assembly is fully diploid) from Canu-corrected PacBio reads ... but I wondered how long to give the kmer index sorting before giving up? Flye-modules is currently the only significant user process on a dedicated 380MB/56core assembly machine. Its been running with 32 threads for about 9 hours in the step "sorting kmer index". Looking at other logs in the Issues I see this step taking about 20 mins.

Log:

[2020-06-24 16:40:32] root: INFO: Starting Flye 2.7.1-b1590 [2020-06-24 16:40:32] root: DEBUG: Cmd: /data/aplatts/miniconda2/envs/flye/bin/flye --pacbio-corr ../****.correctedReads.fasta --out-dir /storage/aplatts/Hex_blue/flye --threads 32 --iterations 2 -g 2.2g [2020-06-24 16:40:32] root: DEBUG: Python version: 3.7.6 | packaged by conda-forge | (default, Jun 1 2020, 18:57:50) [GCC 7.5.0] [2020-06-24 16:40:32] root: INFO: >>>STAGE: configure [2020-06-24 16:40:32] root: INFO: Configuring run [2020-06-24 16:56:05] root: INFO: Total read length: 166116567500 [2020-06-24 16:56:05] root: INFO: Input genome size: 2200000000 [2020-06-24 16:56:05] root: INFO: Estimated coverage: 75 [2020-06-24 16:56:05] root: INFO: Reads N50/N90: 21797 / 8056

[2020-06-25 16:14:50] DEBUG: Ovlp index size: 19795733 [2020-06-25 16:14:50] DEBUG: Inner: 3360412 covered: 3939560 total: 3942290 [2020-06-25 16:24:21] INFO: Assembled 11679 disjointigs [2020-06-25 16:24:56] INFO: Generating sequence [2020-06-25 19:15:42] DEBUG: Writing FASTA [2020-06-25 19:16:20] DEBUG: Peak RAM usage: 217 Gb -----------End assembly log------------ [2020-06-25 19:17:02] root: DEBUG: Disjointigs length: 3414329047, N50: 401717 [2020-06-25 19:17:03] root: INFO: >>>STAGE: consensus [2020-06-25 19:17:53] root: INFO: Running Minimap2 [2020-06-26 02:49:54] root: INFO: Computing consensus [2020-06-26 05:59:56] root: INFO: Alignment error rate: 0.033798 [2020-06-26 06:00:53] root: INFO: >>>STAGE: repeat [2020-06-26 06:00:53] root: INFO: Building and resolving repeat graph [2020-06-26 06:00:53] root: DEBUG: -----Begin repeat analyser log------ [2020-06-26 06:00:53] root: DEBUG: Running: flye-modules repeat --disjointigs /storage/aplatts/Hex_blue/flye/10-consensus/consensus.fasta --reads ../*******.correctedReads.fasta --out-dir /storage/aplatts/Hex_blue/flye/20-repeat --config /data/aplatts/miniconda2/envs/flye/lib/python3.7/site-packages/flye/config/bin_cfg/asm_corrected_reads.cfg --log /storage/aplatts/Hex_blue/flye/flye.log --threads 32 --min-ovlp 5000 --kmer 17 [2020-06-26 06:00:54] DEBUG: Build date: May 8 2020 18:45:50 [2020-06-26 06:00:54] DEBUG: Total RAM: 377 Gb [2020-06-26 06:00:54] DEBUG: Available RAM: 358 Gb [2020-06-26 06:00:54] DEBUG: Total CPUs: 56 [2020-06-26 06:00:54] DEBUG: Loading /data/aplatts/miniconda2/envs/flye/lib/python3.7/site-packages/flye/config/bin_cfg/asm_corrected_reads.cfg [2020-06-26 06:00:54] DEBUG: Loading /data/aplatts/miniconda2/envs/flye/lib/python3.7/site-packages/flye/config/bin_cfg/asm_defaults.cfg [2020-06-26 06:00:54] DEBUG: big_genome_threshold=29000000 [2020-06-26 06:00:54] DEBUG: max_coverage_drop_rate=5 [2020-06-26 06:00:54] DEBUG: chimera_window=100 [2020-06-26 06:00:54] DEBUG: min_reads_in_disjointig=4 [2020-06-26 06:00:54] DEBUG: max_inner_reads=10 [2020-06-26 06:00:54] DEBUG: max_inner_fraction=0.25 [2020-06-26 06:00:54] DEBUG: max_separation=500 [2020-06-26 06:00:54] DEBUG: unique_edge_length=50000 [2020-06-26 06:00:54] DEBUG: min_repeat_res_support=0.51 [2020-06-26 06:00:54] DEBUG: out_paths_ratio=5 [2020-06-26 06:00:54] DEBUG: graph_cov_drop_rate=5 [2020-06-26 06:00:54] DEBUG: coverage_estimate_window=100 [2020-06-26 06:00:54] DEBUG: max_bubble_length=50000 [2020-06-26 06:00:54] DEBUG: loop_coverage_rate=1.5 [2020-06-26 06:00:54] DEBUG: repeat_edge_cov_mult=1.75 [2020-06-26 06:00:54] DEBUG: weak_detach_rate=5 [2020-06-26 06:00:54] DEBUG: tip_coverage_rate=2 [2020-06-26 06:00:54] DEBUG: tip_length_rate=2 [2020-06-26 06:00:54] DEBUG: low_cutoff_warning=0 [2020-06-26 06:00:54] DEBUG: hard_min_coverage_rate=50 [2020-06-26 06:00:54] DEBUG: assemble_kmer_sample=2 [2020-06-26 06:00:54] DEBUG: repeat_graph_kmer_sample=2 [2020-06-26 06:00:54] DEBUG: read_align_kmer_sample=2 [2020-06-26 06:00:54] DEBUG: meta_read_top_kmer_rate=0.75 [2020-06-26 06:00:54] DEBUG: meta_read_filter_kmer_freq=50 [2020-06-26 06:00:54] DEBUG: maximum_jump=1500 [2020-06-26 06:00:54] DEBUG: maximum_overhang=500 [2020-06-26 06:00:54] DEBUG: repeat_kmer_rate=100 [2020-06-26 06:00:54] DEBUG: assemble_ovlp_relative_divergence=0.03 [2020-06-26 06:00:54] DEBUG: repeat_graph_ovlp_divergence=0.03 [2020-06-26 06:00:54] DEBUG: read_align_ovlp_divergence=0.03 [2020-06-26 06:00:54] DEBUG: add_unassembled_reads=0 [2020-06-26 06:00:54] DEBUG: extend_contigs_with_repeats=1 [2020-06-26 06:00:54] DEBUG: min_read_cov_cutoff=3 [2020-06-26 06:00:54] DEBUG: short_tip_length=10000 [2020-06-26 06:00:54] DEBUG: long_tip_length=100000 [2020-06-26 06:00:54] DEBUG: Running with k-mer size: 17 [2020-06-26 06:00:54] DEBUG: Selected minimum overlap 5000 [2020-06-26 06:00:54] DEBUG: Metagenome mode: N [2020-06-26 06:00:54] INFO: Parsing disjointigs [2020-06-26 06:01:16] DEBUG: Building positional index [2020-06-26 06:01:16] DEBUG: Total sequence: 3421613339 bp [2020-06-26 06:01:16] INFO: Building repeat graph [2020-06-26 06:01:16] DEBUG: Hard threshold set to 1 [2020-06-26 06:01:16] DEBUG: Started k-mer counting [2020-06-26 06:02:39] DEBUG: Repetitive k-mer frequency: 348 [2020-06-26 06:02:39] DEBUG: Filtered 109330 repetitive k-mers (0.000222484) [2020-06-26 06:02:44] DEBUG: Sampling rate: 2 [2020-06-26 06:02:44] DEBUG: Solid k-mers: 491296629 [2020-06-26 06:02:44] DEBUG: K-mer index size: 1615922644 [2020-06-26 06:02:44] DEBUG: Mean k-mer frequency: 3.2891 [2020-06-26 06:05:37] DEBUG: Sorting k-mer index Its now 3.32pm ... not sure whether I should restart with different params or try to role back to version 2.7.0 or just keep waiting?
plattsad commented 4 years ago

It completed a few hours later ...