mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
781 stars 168 forks source link

Genome assembly shorter than expected #567

Closed amycjack closed 1 year ago

amycjack commented 1 year ago

Hello, I'm trying to assembly a 500 Mb genome with PacBio reads. Our reads are relatively short, average 7 k bp, with poor coverage of 10 x. Flye results in an assembly around 170 Mb. I've presumed its to do to our poor coverage? Is there anything in the below is the log output that might prove problematic or any suggesting to improve? Many thanks for your help.

[2023-01-28 14:57:26] root: INFO: >>>STAGE: configure [2023-01-28 14:57:26] root: INFO: Configuring run [2023-01-28 14:57:47] root: INFO: Total read length: 5425409572 [2023-01-28 14:57:47] root: INFO: Input genome size: 509000000 [2023-01-28 14:57:47] root: INFO: Estimated coverage: 10 [2023-01-28 14:57:47] root: INFO: Reads N50/N90: 10492 / 3449 [2023-01-28 14:57:47] root: INFO: Minimum overlap set to 3000 [2023-01-28 14:57:47] root: INFO: >>>STAGE: assembly [2023-01-28 14:57:47] root: INFO: Assembling disjointigs [2023-01-28 14:57:47] root: DEBUG: -----Begin assembly log------ [2023-01-28 14:57:47] root: DEBUG: Running: flye-modules assemble --reads /mainfs/scratch/acj1n18/echium_genome/working_data/all_pacbio_echium.fasta --out-asm /scratch/acj1n18/echium_genome/flye/assembly_alldata/00-assembly/draft_assembly.fasta --config /mainfs/scratch/acj1n18/echium_genome/flye/Flye/flye/config/bin_cfg/asm_raw_reads.cfg --log /scratch/acj1n18/echium_genome/flye/assembly_alldata/flye.log --threads 20 --genome-size 509000000 --min-ovlp 3000 [2023-01-28 14:57:47] DEBUG: Build date: Jan 26 2023 11:20:18 [2023-01-28 14:57:47] DEBUG: Total RAM: 1511 Gb [2023-01-28 14:57:47] DEBUG: Available RAM: 1410 Gb [2023-01-28 14:57:47] DEBUG: Total CPUs: 64 [2023-01-28 14:57:47] DEBUG: Loading /mainfs/scratch/acj1n18/echium_genome/flye/Flye/flye/config/bin_cfg/asm_raw_reads.cfg [2023-01-28 14:57:47] DEBUG: Loading /mainfs/scratch/acj1n18/echium_genome/flye/Flye/flye/config/bin_cfg/asm_defaults.cfg [2023-01-28 14:57:47] DEBUG: big_genome_threshold=29000000 [2023-01-28 14:57:47] DEBUG: meta_read_filter_kmer_freq=100 [2023-01-28 14:57:47] DEBUG: chain_large_gap_penalty=2 [2023-01-28 14:57:47] DEBUG: chain_small_gap_penalty=0.5 [2023-01-28 14:57:47] DEBUG: chain_gap_jump_threshold=100 [2023-01-28 14:57:47] DEBUG: max_coverage_drop_rate=5 [2023-01-28 14:57:47] DEBUG: max_extensions_drop_rate=5 [2023-01-28 14:57:47] DEBUG: chimera_window=100 [2023-01-28 14:57:47] DEBUG: chimera_overhang=1000 [2023-01-28 14:57:47] DEBUG: min_reads_in_disjointig=4 [2023-01-28 14:57:47] DEBUG: max_inner_reads=10 [2023-01-28 14:57:47] DEBUG: max_inner_fraction=0.25 [2023-01-28 14:57:47] DEBUG: max_separation=500 [2023-01-28 14:57:47] DEBUG: unique_edge_length=50000 [2023-01-28 14:57:47] DEBUG: min_repeat_res_support=0.51 [2023-01-28 14:57:47] DEBUG: out_paths_ratio=5 [2023-01-28 14:57:47] DEBUG: graph_cov_drop_rate=5 [2023-01-28 14:57:47] DEBUG: coverage_estimate_window=100 [2023-01-28 14:57:47] DEBUG: max_bubble_length=50000 [2023-01-28 14:57:47] DEBUG: loop_coverage_rate=1.5 [2023-01-28 14:57:47] DEBUG: repeat_edge_cov_mult=1.75 [2023-01-28 14:57:47] DEBUG: weak_detach_rate=5 [2023-01-28 14:57:47] DEBUG: tip_coverage_rate=2 [2023-01-28 14:57:47] DEBUG: tip_length_rate=2 [2023-01-28 14:57:47] DEBUG: output_gfa_before_rr=0 [2023-01-28 14:57:47] DEBUG: remove_alt_edges=0 [2023-01-28 14:57:47] DEBUG: low_cutoff_warning=1 [2023-01-28 14:57:47] DEBUG: kmer_size=17 [2023-01-28 14:57:47] DEBUG: use_minimizers=0 [2023-01-28 14:57:47] DEBUG: reads_base_alignment=0 [2023-01-28 14:57:47] DEBUG: meta_read_top_kmer_rate=0.40 [2023-01-28 14:57:47] DEBUG: maximum_jump=1500 [2023-01-28 14:57:47] DEBUG: maximum_overhang=1500 [2023-01-28 14:57:47] DEBUG: repeat_kmer_rate=100 [2023-01-28 14:57:47] DEBUG: assemble_ovlp_divergence=0.10 [2023-01-28 14:57:47] DEBUG: assemble_divergence_relative=1 [2023-01-28 14:57:47] DEBUG: repeat_graph_ovlp_divergence=0.08 [2023-01-28 14:57:47] DEBUG: read_align_ovlp_divergence=0.25 [2023-01-28 14:57:47] DEBUG: hpc_scoring_on=0 [2023-01-28 14:57:47] DEBUG: add_unassembled_reads=0 [2023-01-28 14:57:47] DEBUG: extend_contigs_with_repeats=0 [2023-01-28 14:57:47] DEBUG: min_read_cov_cutoff=3 [2023-01-28 14:57:47] DEBUG: short_tip_length=20000 [2023-01-28 14:57:47] DEBUG: long_tip_length=100000 [2023-01-28 14:57:47] DEBUG: Running with k-mer size: 17 [2023-01-28 14:57:47] DEBUG: Running with minimum overlap 3000 [2023-01-28 14:57:47] DEBUG: Metagenome mode: N [2023-01-28 14:57:47] DEBUG: Short mode: N [2023-01-28 14:57:47] INFO: Reading sequences [2023-01-28 14:58:40] DEBUG: Building positional index [2023-01-28 14:58:40] DEBUG: Total sequence: 4996915438 bp [2023-01-28 14:58:43] INFO: Counting k-mers: [2023-01-28 15:01:23] DEBUG: Updating k-mer histogram [2023-01-28 15:03:52] DEBUG: Hash size: 19925530 [2023-01-28 15:03:52] DEBUG: Total k-mers 1723494274 [2023-01-28 15:03:54] INFO: Filling index table (1/2) [2023-01-28 15:06:31] DEBUG: Mean k-mer frequency: 8.74894 [2023-01-28 15:06:31] DEBUG: Repetitive k-mer frequency: 874 [2023-01-28 15:06:31] DEBUG: Filtered 466131580 repetitive k-mers (0.2558) [2023-01-28 15:06:43] INFO: Filling index table (2/2) [2023-01-28 15:08:56] DEBUG: Sorting k-mer index [2023-01-28 15:09:23] DEBUG: Selected k-mers: 339346939 [2023-01-28 15:09:23] DEBUG: Index size: 1452625486 [2023-01-28 15:09:23] DEBUG: Mean k-mer index frequency: 4.28065 [2023-01-28 15:09:23] DEBUG: Peak RAM usage: 31 Gb [2023-01-28 15:09:23] DEBUG: Estimating k-mer identity bias [2023-01-28 15:09:29] DEBUG: Initial divergence estimate : 0.192152 [2023-01-28 15:09:29] DEBUG: Relative threshold: Y [2023-01-28 15:09:29] DEBUG: Max divergence threshold set to 0.292152 [2023-01-28 15:09:29] INFO: Extending reads [2023-01-28 15:09:29] DEBUG: Estimating overlap coverage [2023-01-28 15:10:44] INFO: Overlap-based coverage: 5 [2023-01-28 15:10:44] INFO: Median overlap divergence: 0.193695 [2023-01-28 15:10:44] DEBUG: Sequence divergence distribution ........... [2023-01-28 20:03:23] root: INFO: Assembly statistics:

Total length:   170326738
Fragments:  4534
Fragments N50:  156363
Largest frg:    1768564
Scaffolds:  18
Mean coverage:  17
mikolmogorov commented 1 year ago

@amycjack could you send the full flye.log file?

mikolmogorov commented 1 year ago

Assuming this is resolved now, feel free to follow up if not!