mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
789 stars 167 forks source link

really short genomes assembly (virus) #277

Closed BioRB closed 4 years ago

BioRB commented 4 years ago

Hello Developer, I'm trying to use your tool to perform assembling from MinION run of 1 single amplicon of 7441 bp . When I launch the program it says:"Expected read coverage is 496301, the assembly is not guaranteed to be optimal in this setting. Are you sure that the genome size was entered correctly?" Is there a specific setup that would work in my experimental condition? However, I launched Flye 3 days ago and it is still running now. here follows the log file. Could you please, provide a correct setting for that short sequence? I don-t need to create long assembling, just obtain the contig representative of this 7441 bp sequence that we specifically sequenced with minion. Here below the log file. Thanks. R.B.

[2020-06-12 10:08:36] root: INFO: Starting Flye 2.7.1-b1590 [2020-06-12 10:08:36] root: DEBUG: Cmd: /home/brancaccior/miniconda3/bin/flye --nano-raw /data/icb/minion/work/minion_fin/icb2_fin_1/guppy_out/barcode01/run1_only_bc01.fastq --genome-size 7441 --out-dir ./flyeout --threads 10 [2020-06-12 10:08:36] root: DEBUG: Python version: 3.6.7 | packaged by conda-forge | (default, Nov 21 2018, 02:32:25) [GCC 4.8.2 20140120 (Red Hat 4.8.2-15)] [2020-06-12 10:08:36] root: INFO: >>>STAGE: configure [2020-06-12 10:08:36] root: INFO: Configuring run [2020-06-12 10:11:56] root: INFO: Total read length: 5388009936 [2020-06-12 10:11:56] root: INFO: Input genome size: 7441 [2020-06-12 10:11:56] root: INFO: Estimated coverage: 724097 [2020-06-12 10:11:56] root: WARNING: Expected read coverage is 724097, the assembly is not guaranteed to be optimal in this setting. Are you sure that the genome size was entered correctly? [2020-06-12 10:11:56] root: INFO: Reads N50/N90: 7329 / 1652 [2020-06-12 10:11:56] root: INFO: Minimum overlap set to 2000 [2020-06-12 10:11:56] root: INFO: Selected k-mer size: 15 [2020-06-12 10:11:56] root: INFO: >>>STAGE: assembly [2020-06-12 10:11:56] root: INFO: Assembling disjointigs [2020-06-12 10:11:56] root: DEBUG: -----Begin assembly log------ [2020-06-12 10:11:56] root: DEBUG: Running: flye-modules assemble --reads /data/icb/minion/work/minion_fin/icb2_fin_1/guppy_out/barcode01/run1_only_bc01.fastq --out-asm /home/brancaccior/flyeout/00-assembly/draft_assembly.fasta --genome-size 7441 --config /home/brancaccior/miniconda3/lib/python3.6/site-packages/flye/config/bin_cfg/asm_raw_reads.cfg --log /home/brancaccior/flyeout/flye.log --threads 10 --min-ovlp 2000 --kmer 15 [2020-06-12 10:11:56] DEBUG: Build date: May 8 2020 18:52:44 [2020-06-12 10:11:56] DEBUG: Total RAM: 377 Gb [2020-06-12 10:11:56] DEBUG: Available RAM: 373 Gb [2020-06-12 10:11:56] DEBUG: Total CPUs: 80 [2020-06-12 10:11:56] DEBUG: Loading /home/brancaccior/miniconda3/lib/python3.6/site-packages/flye/config/bin_cfg/asm_raw_reads.cfg [2020-06-12 10:11:56] DEBUG: Loading /home/brancaccior/miniconda3/lib/python3.6/site-packages/flye/config/bin_cfg/asm_defaults.cfg [2020-06-12 10:11:56] DEBUG: big_genome_threshold=29000000 [2020-06-12 10:11:56] DEBUG: max_coverage_drop_rate=5 [2020-06-12 10:11:56] DEBUG: chimera_window=100 [2020-06-12 10:11:56] DEBUG: min_reads_in_disjointig=4 [2020-06-12 10:11:56] DEBUG: max_inner_reads=10 [2020-06-12 10:11:56] DEBUG: max_inner_fraction=0.25 [2020-06-12 10:11:56] DEBUG: max_separation=500 [2020-06-12 10:11:56] DEBUG: unique_edge_length=50000 [2020-06-12 10:11:56] DEBUG: min_repeat_res_support=0.51 [2020-06-12 10:11:56] DEBUG: out_paths_ratio=5 [2020-06-12 10:11:56] DEBUG: graph_cov_drop_rate=5 [2020-06-12 10:11:56] DEBUG: coverage_estimate_window=100 [2020-06-12 10:11:56] DEBUG: max_bubble_length=50000 [2020-06-12 10:11:56] DEBUG: loop_coverage_rate=1.5 [2020-06-12 10:11:56] DEBUG: repeat_edge_cov_mult=1.75 [2020-06-12 10:11:56] DEBUG: weak_detach_rate=5 [2020-06-12 10:11:56] DEBUG: tip_coverage_rate=2 [2020-06-12 10:11:56] DEBUG: tip_length_rate=2 [2020-06-12 10:11:56] DEBUG: low_cutoff_warning=1 [2020-06-12 10:11:56] DEBUG: hard_min_coverage_rate=10 [2020-06-12 10:11:56] DEBUG: assemble_kmer_sample=1 [2020-06-12 10:11:56] DEBUG: repeat_graph_kmer_sample=1 [2020-06-12 10:11:56] DEBUG: read_align_kmer_sample=1 [2020-06-12 10:11:56] DEBUG: meta_read_top_kmer_rate=0.25 [2020-06-12 10:11:56] DEBUG: meta_read_filter_kmer_freq=10 [2020-06-12 10:11:56] DEBUG: maximum_jump=1500 [2020-06-12 10:11:56] DEBUG: maximum_overhang=1500 [2020-06-12 10:11:56] DEBUG: repeat_kmer_rate=100 [2020-06-12 10:11:56] DEBUG: assemble_ovlp_relative_divergence=0.10 [2020-06-12 10:11:56] DEBUG: repeat_graph_ovlp_divergence=0.10 [2020-06-12 10:11:56] DEBUG: read_align_ovlp_divergence=0.25 [2020-06-12 10:11:56] DEBUG: add_unassembled_reads=0 [2020-06-12 10:11:56] DEBUG: extend_contigs_with_repeats=1 [2020-06-12 10:11:56] DEBUG: min_read_cov_cutoff=3 [2020-06-12 10:11:56] DEBUG: short_tip_length=20000 [2020-06-12 10:11:56] DEBUG: long_tip_length=100000 [2020-06-12 10:11:56] DEBUG: Running with k-mer size: 15 [2020-06-12 10:11:56] DEBUG: Running with minimum overlap 2000 [2020-06-12 10:11:56] DEBUG: Metagenome mode: N [2020-06-12 10:11:56] INFO: Reading sequences [2020-06-12 10:12:34] DEBUG: Building positional index [2020-06-12 10:12:34] DEBUG: Total sequence: 5388009936 bp [2020-06-12 10:12:34] DEBUG: Expected read coverage: 724097 [2020-06-12 10:12:34] INFO: Generating solid k-mer index [2020-06-12 10:12:34] DEBUG: Hard threshold set to 5 [2020-06-12 10:12:34] DEBUG: Started k-mer counting [2020-06-12 10:12:49] INFO: Counting k-mers (1/2): [2020-06-12 10:13:17] INFO: Counting k-mers (2/2): [2020-06-12 10:14:53] DEBUG: Estimated minimum kmer coverage: 175885 [2020-06-12 10:14:53] DEBUG: Filtered 37198216 erroneous k-mers [2020-06-12 10:14:53] DEBUG: Repetitive k-mer frequency: 40227536 [2020-06-12 10:14:53] DEBUG: Filtered 0 repetitive k-mers (0) [2020-06-12 10:14:53] INFO: Filling index table [2020-06-12 10:14:53] DEBUG: Sampling rate: 1 [2020-06-12 10:14:53] DEBUG: Solid k-mers: 7441 [2020-06-12 10:14:53] DEBUG: K-mer index size: 2993733424 [2020-06-12 10:14:53] DEBUG: Mean k-mer frequency: 402329 [2020-06-12 10:15:57] DEBUG: Sorting k-mer index [2020-06-12 10:17:14] DEBUG: Peak RAM usage: 17 Gb [2020-06-12 10:17:14] DEBUG: Estimating k-mer identity bias [2020-06-12 14:49:40] DEBUG: Median overlap divergence: 0.135055 [2020-06-12 14:49:40] DEBUG: K-mer estimate bias (true - est): 0.0549342 [2020-06-12 14:49:40] DEBUG: Max divergence threshold set to 0.235055 [2020-06-12 14:49:40] INFO: Extending reads [2020-06-12 14:49:40] DEBUG: Estimating overlap coverage [2020-06-14 08:45:33] INFO: Overlap-based coverage: 606404 [2020-06-14 08:45:33] INFO: Median overlap divergence: 0.121465 [2020-06-14 08:45:33] DEBUG: Sequence divergence distribution:

|                        *                      |                                                    
|                        *                      |                                                    
|                        *                      |                                                    
|                      * *                      |                                                    
|                     ****                      |                                                    
|                     ****                      |                                                    
|                     ****                      |                                                    
|                     **** *                    |                                                    
|                     ******                    |                                                    
|                     ******                    |                                                    
|                     ******                    |                                                    
|                     ******                    |                                                    
|                    *******                    |                                                    
|                    *******                    |                                                    
|                    *******                    |                                                    
|                    *********                  |                                                    
|                    ********** *               |                                                    
|                    ************ *             |                                                    
|                    ************ *             |                                                    
|                   *************** * * *   *** |              *                                     
----------------------------------------------------------------------------------------------------
0%        5%        10%       15%       20%       25%       30%       35%       40%       45%       

Q25 = 0.11, Q50 = 0.12, Q75 = 0.13

[2020-06-15 11:41:11] root: ERROR: Looks like the system ran out of memory [2020-06-15 11:41:11] root: ERROR: Command '['flye-modules', 'assemble', '--reads', '/data/icb/minion/work/minion_fin/icb2_fin_1/guppy_out/barcode01/run1_only_bc01.fastq', '--out-asm', '/home/brancaccior/flyeout/00-assembly/draft_assembly.fasta', '--genome-size', '7441', '--config', '/home/brancaccior/miniconda3/lib/python3.6/site-packages/flye/config/bin_cfg/asm_raw_reads.cfg', '--log', '/home/brancaccior/flyeout/flye.log', '--threads', '10', '--min-ovlp', '2000', '--kmer', '15']' died with <Signals.SIGKILL: 9>. [2020-06-15 11:41:11] root: ERROR: Pipeline aborted

nahanoo commented 4 years ago

The developers state in their documentation that short genomes are not suitable for Flye. However you can try your look by setting --asm-coverage 100 and add the --plasmid flag. Maybe it also helps to play around with the --min-overlap flag which is set to 2000 bp by default. Cheers

mikolmogorov commented 4 years ago

@BioRB

As @nahanoo suggested, Flye currently does not do well on very short sequences (e.g. short viruses or amplicons). It seems like in your case the assembly is not required, as the target genome is fully covered by single reads. I'd try to run any kind of polishing methods on one of your longest reads.

Best, Mikhail