mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
763 stars 165 forks source link

No disjointigs were assembled #211

Closed Chandrima-04 closed 4 years ago

Chandrima-04 commented 4 years ago

Hi, I looked through the suggestions and tried using --meta parameter as well as --asm-coverage, but somehow no assembly is getting formed. I have raw metagenomic data from oxford nanopore. I am getting an overlap-based coverage of 1. In the past issues, I have seen that the files which failed assembly had coverage of 0. I am attaching the log file too! flye.log

mikolmogorov commented 4 years ago

Hi,

From what I can tell, you are assembling a very short sequence (e.g. 100kb) - is that so? Flye was not designed for that, unfortunately (e.g. for amplicons / viral sequences).

Chandrima-04 commented 4 years ago

No it is metagenome, mostly bacterial!

mikolmogorov commented 4 years ago

Ok. I don't see anything wrong in the log file otherwise. Most likely, there is simply not enough coverage to assemble any chromosomes. There is 38 Mb of reads, which would not be sufficient to assemble an isolate, and the size of metagenome could be much larger - we have experience in assembling gigabases.

ptrebert commented 4 years ago

I have the same error with flye 2.6 (installed via Conda) with a PacBio human dataset (uncorrected reads, ~80x total coverage; assembled with preset --pacbio-raw and --asm-coverage 50, exepcted genome size was set to 3.1g).

frihaka commented 4 years ago

Hi, thanks a lot for this software, it's really great, I am using it for bacterial genomes assembly. Flye2.6 has been working very well with other datasets so far - 16plexed ones.

But with datasets with higher depth, I cannot make it work anymore. I am running the default command:

flye --pacbio-raw bbmap_fasta/dataset.fasta --genome-size 1.1m --out-dir flye_default_param/dataset --threads 12

The run fails with the same error message as for other users above:

[2020-02-03 06:58:49] root: INFO: Starting Flye 2.6-release
[2020-02-03 06:58:49] root: DEBUG: Cmd: /home/user/miniconda2/bin/flye --pacbio-raw bbmap_fasta/dataset.fasta --genome-size 1.1m --out-dir flye_default_param/dataset --threads 12
[2020-02-03 06:58:49] root: DEBUG: Python version: 2.7.17 |Anaconda, Inc.| (default, Oct 21 2019, 19:04:46) 
[GCC 7.3.0]
[2020-02-03 06:58:49] root: INFO: >>>STAGE: configure
[2020-02-03 06:58:49] root: INFO: Configuring run
[2020-02-03 07:00:10] root: INFO: Total read length: 3291173741
[2020-02-03 07:00:10] root: INFO: Input genome size: 1100000
[2020-02-03 07:00:10] root: INFO: Estimated coverage: 2991
[2020-02-03 07:00:10] root: WARNING: Expected read coverage is 2991, the assembly is not guaranteed to be optimal in this setting. Are you sure that the genome size was entered correctly?
[2020-02-03 07:00:10] root: INFO: Reads N50/N90: 5434 / 1988
[2020-02-03 07:00:10] root: INFO: Minimum overlap set to 2000
[2020-02-03 07:00:10] root: INFO: Selected k-mer size: 15
[2020-02-03 07:00:10] root: INFO: >>>STAGE: assembly
[2020-02-03 07:00:10] root: INFO: Assembling disjointigs
[2020-02-03 07:00:10] root: DEBUG: -----Begin assembly log------
[2020-02-03 07:00:10] root: DEBUG: Running: flye-assemble --reads bbmap_fasta/dataset.fasta --out-asm flye_default_param/dataset/00-assembly/draft_assembly.fasta --genome-size 1100000 --config /home/user/miniconda2/lib/python2.7/site-packages/flye/config/bin_cfg/asm_raw_reads.cfg --log flye_default_param/dataset/flye.log --threads 12 --min-ovlp 2000 --kmer 15
[2020-02-03 07:00:10] DEBUG: Build date: Sep 19 2019 20:15:45
[2020-02-03 07:00:10] DEBUG: Total RAM: 376 Gb
[2020-02-03 07:00:10] DEBUG: Available RAM: 368 Gb
[2020-02-03 07:00:10] DEBUG: Total CPUs: 40
[2020-02-03 07:00:10] DEBUG: Parameters:
[2020-02-03 07:00:10] DEBUG:    big_genome_threshold=29000000
[2020-02-03 07:00:10] DEBUG:    low_cutoff_warning=1
[2020-02-03 07:00:10] DEBUG:    hard_min_coverage_rate=10
[2020-02-03 07:00:10] DEBUG:    assemble_kmer_sample=1
[2020-02-03 07:00:10] DEBUG:    repeat_graph_kmer_sample=1
[2020-02-03 07:00:10] DEBUG:    read_align_kmer_sample=1
[2020-02-03 07:00:10] DEBUG:    maximum_jump=1500
[2020-02-03 07:00:10] DEBUG:    maximum_overhang=1500
[2020-02-03 07:00:10] DEBUG:    repeat_kmer_rate=100
[2020-02-03 07:00:10] DEBUG:    assemble_ovlp_relative_divergence=0.10
[2020-02-03 07:00:10] DEBUG:    repeat_graph_ovlp_divergence=0.15
[2020-02-03 07:00:10] DEBUG:    read_align_ovlp_divergence=0.25
[2020-02-03 07:00:10] DEBUG:    max_coverage_drop_rate=5
[2020-02-03 07:00:10] DEBUG:    chimera_window=100
[2020-02-03 07:00:10] DEBUG:    min_reads_in_disjointig=4
[2020-02-03 07:00:10] DEBUG:    max_inner_reads=10
[2020-02-03 07:00:10] DEBUG:    max_inner_fraction=0.25
[2020-02-03 07:00:10] DEBUG:    add_unassembled_reads=0
[2020-02-03 07:00:10] DEBUG:    max_separation=500
[2020-02-03 07:00:10] DEBUG:    unique_edge_length=50000
[2020-02-03 07:00:10] DEBUG:    min_repeat_res_support=0.51
[2020-02-03 07:00:10] DEBUG:    out_paths_ratio=5
[2020-02-03 07:00:10] DEBUG:    graph_cov_drop_rate=5
[2020-02-03 07:00:10] DEBUG:    coverage_estimate_window=100
[2020-02-03 07:00:10] DEBUG:    extend_contigs_with_repeats=1
[2020-02-03 07:00:10] DEBUG:    min_read_cov_cutoff=3
[2020-02-03 07:00:10] DEBUG:    short_tip_length=10000
[2020-02-03 07:00:10] DEBUG:    long_tip_length=100000
[2020-02-03 07:00:10] DEBUG:    max_bubble_length=50000
[2020-02-03 07:00:10] DEBUG: Running with k-mer size: 15
[2020-02-03 07:00:10] DEBUG: Running with minimum overlap 2000
[2020-02-03 07:00:10] DEBUG: Metagenome mode: N
[2020-02-03 07:00:10] INFO: Reading sequences
[2020-02-03 07:01:06] DEBUG: Building positional index
[2020-02-03 07:01:06] DEBUG: Total sequence: 3291173741 bp
[2020-02-03 07:01:06] DEBUG: Expected read coverage: 2991
[2020-02-03 07:01:06] INFO: Generating solid k-mer index
[2020-02-03 07:01:06] DEBUG: Hard threshold set to 5
[2020-02-03 07:01:06] DEBUG: Started k-mer counting
[2020-02-03 07:01:20] INFO: Counting k-mers (1/2):
[2020-02-03 07:01:50] INFO: Counting k-mers (2/2):
[2020-02-03 07:03:00] DEBUG: Estimated minimum kmer coverage: 507
[2020-02-03 07:03:00] DEBUG: Filtered 88309991 erroneous k-mers
[2020-02-03 07:03:00] DEBUG: Repetitive k-mer frequency: 95540
[2020-02-03 07:03:00] DEBUG: Filtered 14 repetitive k-mers (1.27291e-05)
[2020-02-03 07:03:00] INFO: Filling index table
[2020-02-03 07:03:01] DEBUG: Sampling rate: 1
[2020-02-03 07:03:01] DEBUG: Solid k-mers: 1099828
[2020-02-03 07:03:01] DEBUG: K-mer index size: 1045597867
[2020-02-03 07:03:01] DEBUG: Mean k-mer frequency: 950.692
[2020-02-03 07:03:52] DEBUG: Sorting k-mer index
[2020-02-03 07:04:09] DEBUG: Peak RAM usage: 6 Gb
[2020-02-03 07:04:09] DEBUG: Estimating k-mer identity bias
[2020-02-03 07:04:53] DEBUG: Median overlap divergence: 0.170039
[2020-02-03 07:04:53] DEBUG: K-mer estimate bias: -0.00528269
[2020-02-03 07:04:53] DEBUG: Max divergence threshold set to 0.270039
[2020-02-03 07:04:53] INFO: Extending reads
[2020-02-03 07:04:53] DEBUG: Estimating overlap coverage
[2020-02-03 07:09:37] INFO: Overlap-based coverage: 1685
[2020-02-03 07:09:37] INFO: Median overlap divergence: 0.174149
[2020-02-03 07:09:37] DEBUG: Sequence divergence distribution: 

    |                               **                     |                                             
    |                               ***                    |                                             
    |                               ****                   |                                             
    |                              *****                   |                                             
    |                              *****                   |                                             
    |                              *****                   |                                             
    |                              ********                |                                             
    |                              ********                |                                             
    |                             **********               |                                             
    |                             **********               |                                             
    |                             **********               |                                             
    |                            *************             |                                             
    |                            *************             |                                             
    |                            *************             |                                             
    |                            **************            |                                             
    |                            ***************  *        |                                             
    |                            **************** * **     |                                             
    |                           ***********************    |                                             
    |                          *************************   |        *                                    
    |                        ********************************   * * **     * **  *   * **      *         
    ----------------------------------------------------------------------------------------------------
    0%        5%        10%       15%       20%       25%       30%       35%       40%       45%       

    Q25 = 0.16, Q50 = 0.17, Q75 = 0.2

[2020-02-03 13:37:09] INFO: Assembled 0 disjointigs
[2020-02-03 13:37:09] INFO: Generating sequence
[2020-02-03 13:37:09] DEBUG: Writing FASTA
[2020-02-03 13:37:09] DEBUG: Peak RAM usage: 26 Gb
-----------End assembly log------------
[2020-02-03 13:37:10] root: ERROR: No disjointigs were assembled - please check if the read type and genome size parameters are correct
mikolmogorov commented 4 years ago

@ptrebert @frihaka please follow the suggestions from #128.

I am marking this issue as a duplicate. Please continue the discussion in #128 if those solutions did not help.

frihaka commented 4 years ago

sorry, I had missed #128.

Indeed, playing with --asm-coverage values and --meta options solved the issue. Thanks!

yige-luo commented 3 years ago

Hi,

From what I can tell, you are assembling a very short sequence (e.g. 100kb) - is that so? Flye was not designed for that, unfortunately (e.g. for amplicons / viral sequences).

Hi,

I have a quick question - can the latest Flye version (2.8.2) handle very short assemblies (amplicon/viral)?

mikolmogorov commented 3 years ago

@drosophila92 There were no significant changes with that. Flye might assemble some, but full support is not guaranteed.