mikolmogorov / Flye

De novo assembler for single molecule sequencing reads using repeat graphs
Other
781 stars 168 forks source link

subassemblies clarification question #56

Closed devonorourke closed 5 years ago

devonorourke commented 6 years ago

Hi Mikhail, Flye is working great for my mammalian genome assemblies (using raw Nanopore data). One of the species I'm working on has an existing reference completed over a decade ago with Sanger data and I was wondering if the subassemblies argument in Flye could work by combining my Nanopore data assembled with Flye with these Sanger-derived contigs? Where would that process start within Flye? Thanks very much, Devon

mikolmogorov commented 6 years ago

Hi Devon,

Glad that it worked well for you! Yes, you can definitely try --subassemblies option for merging these two assemblies. You basically need to make a separate run and provide the two assemblies as input. One thing to keep in mind is that if there are structural differences between the two sets of contigs, they will be resolved arbitrary in the combined assembly (since there is no obvious way to check which of the two variants is correct). Also, the versions prior to 2.3.5 required to manually set minimum overlap parameter (-m) to 1000, but the latest one does this automatically.

Let me know if you have any extra questions, Mikhail

devonorourke commented 6 years ago

Cool. So let's say the original argument (just Nanopore data) was the following:

FASTX=path/to/nanopore_reads.fq.gz
OUTDIR=path/to/assembly/flye

flye --nano-raw $FASTX --out-dir $OUTDIR --genome-size 2g --threads 24 --iterations 3

This generated, among other things, a scaffolds.fasta file. Let's say that lives here:

ONTASSEMBLY=/path/to/assembly/flye/scaffolds.fasta

The flye --help menu suggests entering the path to a single fasta file for the --subassemblies argument, but your response suggests I'll be inputting multiple assemblies. I'm unclear on how to include the two assemblies as input.

If you don't mind going along with my syntax as above, let's also say that the additional assembly from old Sanger data lives here:

OLDASSEMBLY=/path/to/assembly/sanger/scaffolds.fa

What I'd like to know is how to incorporate all these into a single argument to run the --subassemblies command.

Thanks again for your clarification.

mikolmogorov commented 6 years ago

You may provide multiple input files under the same argument: flye --subassemblies $ONTASSEMBLY $OLDASSEMBLY -o $OUTDIR -g 2g -t 24 -i 3

Hope this helps!

devonorourke commented 6 years ago

Hi again Mikhail, The --subassemblies command seems to be hanging after calculating mean edge coverage. The .log file contains a lot of info, but I've tried to parse out what I think is representative:

[2018-08-21 12:33:47] root: INFO: Running Flye 2.3.5-release
[2018-08-21 12:33:47] root: DEBUG: Cmd: /mnt/lustre/macmaneslab/devon/.conda/envs/mcscan-env/bin/flye --resume --subassemblies /mnt/lustre/macmaneslab/devon/batgenomes/assembly/flye/mylu/rough/scaffolds.fasta /mnt/lustre/macmaneslab/devon/batgenomes/alignments/mylu/mylu_bwa/mylu_contigs.fa --out-dir /mnt/lustre/macmaneslab/devon/batgenomes/assembly/flye/mylu/wBroad --genome-size 2g --threads 24
[2018-08-21 12:33:47] root: INFO: Resuming previous run
[2018-08-21 12:33:47] root: INFO: Performing repeat analysis
[2018-08-21 12:33:47] root: DEBUG: -----Begin repeat analyser log------
[2018-08-21 12:33:47] root: DEBUG: Running: flye-repeat -l /mnt/lustre/macmaneslab/devon/batgenomes/assembly/flye/mylu/wBroad/flye.log -t 24 /mnt/lustre/macmaneslab/devon/batgenomes/assembly/flye/mylu/wBroad/0-assembly/draft_assembly.fasta /mnt/lustre/macmaneslab/devon/batgenomes/assembly/flye/mylu/rough/scaffolds.fasta,/mnt/lustre/macmaneslab/devon/batgenomes/alignments/mylu/mylu_bwa/mylu_contigs.fa /mnt/lustre/macmaneslab/devon/batgenomes/assembly/flye/mylu/wBroad/2-repeat 2000000000 /mnt/lustre/macmaneslab/devon/.conda/envs/mcscan-env/lib/python2.7/site-packages/flye/resource/asm_subasm.cfg
[2018-08-21 12:33:47] DEBUG: Build date: Aug  8 2018 19:51:55
[2018-08-21 12:33:47] DEBUG: Parameters:
[2018-08-21 12:33:47] DEBUG:    kmer_size=31
[2018-08-21 12:33:47] DEBUG:    kmer_size_big=31
[2018-08-21 12:33:47] DEBUG:    big_genome_threshold=50000000
[2018-08-21 12:33:47] DEBUG:    low_cutoff_warning=0
[2018-08-21 12:33:47] DEBUG:    hard_min_coverage_rate=50
[2018-08-21 12:33:47] DEBUG:    low_minimum_overlap=1000
[2018-08-21 12:33:47] DEBUG:    high_minimum_overlap=1000
[2018-08-21 12:33:47] DEBUG:    assemble_kmer_sample=2
[2018-08-21 12:33:47] DEBUG:    repeat_graph_kmer_sample=2
[2018-08-21 12:33:47] DEBUG:    read_align_kmer_sample=2
[2018-08-21 12:33:47] DEBUG:    maximum_jump=500
[2018-08-21 12:33:47] DEBUG:    maximum_overhang=100
[2018-08-21 12:33:47] DEBUG:    repeat_kmer_rate=100
[2018-08-21 12:33:47] DEBUG:    assemble_ovlp_ident=0.02
[2018-08-21 12:33:47] DEBUG:    repeat_graph_ovlp_ident=0.02
[2018-08-21 12:33:47] DEBUG:    read_align_ovlp_ident=0.02
[2018-08-21 12:33:47] DEBUG:    max_coverage_drop_rate=10
[2018-08-21 12:33:47] DEBUG:    chimera_window=100
[2018-08-21 12:33:47] DEBUG:    min_reads_in_contig=4
[2018-08-21 12:33:47] DEBUG:    max_inner_reads=10
[2018-08-21 12:33:47] DEBUG:    max_inner_fraction=0.25
[2018-08-21 12:33:47] DEBUG:    add_unassembled_reads=1
[2018-08-21 12:33:47] DEBUG:    max_separation=500
[2018-08-21 12:33:47] DEBUG:    tip_length_threshold=1000
[2018-08-21 12:33:47] DEBUG:    unique_edge_length=50000
[2018-08-21 12:33:47] DEBUG:    min_repeat_res_support=0.51
[2018-08-21 12:33:47] DEBUG:    out_paths_ratio=5
[2018-08-21 12:33:47] DEBUG:    graph_cov_drop_rate=10
[2018-08-21 12:33:47] DEBUG:    coverage_estimate_window=100
[2018-08-21 12:33:47] DEBUG:    extend_contigs_with_repeats=0
[2018-08-21 12:33:47] DEBUG: Running with k-mer size: 31
[2018-08-21 12:33:47] INFO: Reading sequences
[2018-08-21 12:34:50] DEBUG: Selected minimum overlap 1000
[2018-08-21 12:34:50] INFO: Building repeat graph
[2018-08-21 12:34:50] DEBUG: Hard threshold set to 1
[2018-08-21 12:34:50] DEBUG: Started kmer counting
[2018-08-21 12:36:52] DEBUG: Repetetive k-mer frequency: 120
[2018-08-21 12:36:52] DEBUG: Filtered 134240 repetitive kmers (8.57721e-05)
[2018-08-21 12:37:06] DEBUG: Sampling rate: 2
[2018-08-21 12:37:06] DEBUG: Solid kmers: 175092058
[2018-08-21 12:37:06] DEBUG: Kmer index size: 432548113
[2018-08-21 12:37:06] DEBUG: Mean k-mer frequency: 2.4704
[2018-08-21 12:38:27] DEBUG: Sorting k-mer index
[2018-08-21 12:44:19] DEBUG: Computing transitive closure for overlaps
[2018-08-21 12:44:23] DEBUG: Found 3229010 overlaps
[2018-08-21 12:44:24] DEBUG: Left 1477636 overlaps after filtering
[2018-08-21 12:44:25] INFO: Median overlap divergence: 0.0160111
[2018-08-21 12:44:25] DEBUG: Sequence divergence distribution: 

    |  *                                                                                                 
    |  *                                                                                                 
    |  **                                                                                                
    |  **                                                                                                
    |  **                                                                                                
    |  **                                                                                                
    |  **                                                                                                
    |  **                                                                                                
    |  **                                                                                                
    |  **                                                                                                
    |  **                                                                                                
    |  **                                                                                                
    |  **                                                                                                
    | ****                                                                                               
    | ****                                                                                               
    | ****                                                                                               
    | *****                                                                                              
    | *****                                                                                              
    | *********                                                                                          
    |************************* *                                                                         
    ----------------------------------------------------------------------------------------------------
    0%        5%        10%       15%       20%       25%       30%       35%       40%       45%       

    Q25 = 0.013, Q50 = 0.016, Q75 = 0.022

[2018-08-21 12:44:25] DEBUG: Computing gluepoints
[2018-08-21 12:44:31] DEBUG: Created 913486 gluepoints
[2018-08-21 12:44:34] DEBUG: Tandems removed: 947 left, 947 right, 22828 both
[2018-08-21 12:44:34] DEBUG: Initializing edges
[2018-08-21 12:49:27] DEBUG: Repetetive k-mer frequency: 116
[2018-08-21 12:49:27] DEBUG: Filtered 121037 repetitive kmers (0.000105844)
[2018-08-21 12:49:41] DEBUG: Sampling rate: 2
[2018-08-21 12:49:41] DEBUG: Solid kmers: 1143420396
[2018-08-21 12:49:41] DEBUG: Kmer index size: 1281198685
[2018-08-21 12:49:41] DEBUG: Mean k-mer frequency: 1.1205
[2018-08-21 12:53:26] DEBUG: Sorting k-mer index
[2018-08-21 12:57:23] DEBUG: Total reads : 86779
[2018-08-21 12:57:23] DEBUG: Read with aligned parts : 86263
[2018-08-21 12:57:23] DEBUG: Aligned in one piece : 76102
[2018-08-21 12:57:23] INFO: Aligned read sequence: 3919805131 / 3951871696 (0.991886)
[2018-08-21 12:57:24] INFO: Median overlap divergence: 0.0111136
[2018-08-21 12:57:24] DEBUG: Sequence divergence distribution: 

    |  *                                                                                                 
    |  *                                                                                                 
    |  *                                                                                                 
    |  *                                                                                                 
    |  *                                                                                                 
    |  *                                                                                                 
    |  *                                                                                                 
    | **                                                                                                 
    | **                                                                                                 
    | **                                                                                                 
    | **                                                                                                 
    | ***                                                                                                
    | ***                                                                                                
    | ***                                                                                                
    | ***                                                                                                
    | ****                                                                                               
    | ****  *****                                                                                        
    | ************                                                                                       
    |**************                                                                                      
    |**************************                                                                          
    ----------------------------------------------------------------------------------------------------
    0%        5%        10%       15%       20%       25%       30%       35%       40%       45%       

    Q25 = -0.00033, Q50 = 0.011, Q75 = 0.024

[2018-08-21 12:57:39] INFO: Mean edge coverage: 1
[2018-08-21 12:57:39] DEBUG: -234885    len:150666  cov:2   mult:2
[2018-08-21 12:57:39] DEBUG: 234885 len:150666  cov:2   mult:2
[2018-08-21 12:57:39] DEBUG: -234884    len:1289    cov:2   mult:2
[2018-08-21 12:57:39] DEBUG: 234884 len:1289    cov:2   mult:2
...
many more lines
...
[2018-08-21 13:37:55] DEBUG: Contig: 58726: 58726
[2018-08-21 13:37:55] DEBUG: Contig: 58758: 58758
[2018-08-21 13:37:55] DEBUG: Contig: 58759: 58759
[2018-08-21 13:37:55] DEBUG: Contig: 58761: 58761
[2018-08-21 13:37:55] DEBUG: Contig: 58762: 58762
[2018-08-21 13:37:55] DEBUG: Contig: 58768: 58768
[2018-08-21 13:37:55] DEBUG: Contig: 58798: 58798
[2018-08-21 13:37:55] DEBUG: Contig: 58804: 58804
[2018-08-21 13:37:55] DEBUG: Contig: 58811: 58811
[2018-08-21 13:37:55] DEBUG: Contig: 58819: 58819
[2018-08-21 13:37:55] DEBUG: Contig: 58826: 58826
[2018-08-21 13:37:55] DEBUG: Contig: 58841: 58841
[2018-08-21 13:37:55] DEBUG: Contig: 58846: 58846
[2018-08-21 13:37:55] DEBUG: Contig: 58882: 58882
[2018-08-21 13:37:55] DEBUG: Contig: 58890: 58890
[2018-08-21 13:37:55] DEBUG: Contig: 58922: 58922
[2018-08-21 13:37:55] DEBUG: Contig: 58927: 58927
[2018-08-21 13:37:55] DEBUG: Contig: 58931: 58931
[2018-08-21 13:37:55] DEBUG: Contig: 58934: 58934
[2018-08-21 13:37:55] DEBUG: Contig: 58971: 58971

This .log file represents a restart. The older log file produced the exact same stop point (same values for Contig: .... However the previous run was stuck here for about 18 hours.

The stderr log file shows this:

[2018-08-21 12:33:47] INFO: Running Flye 2.3.5-release
[2018-08-21 12:33:47] INFO: Resuming previous run
[2018-08-21 12:33:47] INFO: Performing repeat analysis
[2018-08-21 12:33:47] INFO: Reading sequences
[2018-08-21 12:34:50] INFO: Building repeat graph
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2018-08-21 12:44:25] INFO: Median overlap divergence: 0.0160111
[2018-08-21 12:47:34] INFO: Aligning reads to the graph
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
[2018-08-21 12:57:23] INFO: Aligned read sequence: 3919805131 / 3951871696 (0.991886)
[2018-08-21 12:57:24] INFO: Median overlap divergence: 0.0111136
[2018-08-21 12:57:39] INFO: Mean edge coverage: 1
[2018-08-21 13:05:17] INFO: Resolving repeats
[2018-08-21 13:34:27] INFO: Generating contigs
[2018-08-21 13:37:21] INFO: Generated 47845 contigs

Any thoughts? Perhaps I'm missing an important argument in the command?

Cheers

mikolmogorov commented 6 years ago

I don't think it hang, it rather being extremely slow.. Looks like some functions in the very end were not optimized for assemblies with large number of contigs. We will try to fix it in the future. In the mean time, do you have a file "2-repeat/graph_paths.fasta" generated? If so, it is basically your final assembly (no need to wait until the rest of pipeline finishes).

devonorourke commented 6 years ago

Ok, good to know. I'll leave the job for a while and see if anything changes. There is indeed a completed graph_paths.fasta.

Could you please point me to any documentation or scripts that outline exactly what is going on when I run --subassemblies? The main python script seems to look for the subasm argument, but I couldn't figure out where the script goes from there. I'm just trying to understand more about the nature of this assembly option.

Thanks very much

mikolmogorov commented 6 years ago

This is basically the same assembly algorithm - input contigs are treated as "reads". The difference is that some parameters are tuned for high quality sequence with almost no errors (different configuration file from "flye/resource" is loaded).

devonorourke commented 6 years ago

Quick update: the program did indeed finish running. It appears that the step in which the DOT file is written was the major hangup - it took about 30 hours to get that step done. Here's a little snippet of the log file where you can see that lag time in the process:

[2018-08-21 17:27:22] DEBUG: Contig: 58981: 58981
[2018-08-21 17:27:22] DEBUG: Contig: 58989: 58989
[2018-08-22 22:33:48] DEBUG: Writing Dot
[2018-08-22 22:35:59] DEBUG: Writing FASTA
[2018-08-22 22:37:26] DEBUG: Writing Gfa
-----------End assembly log------------
[2018-08-22 22:44:33] root: INFO: Polishing genome (1/1)
[2018-08-22 22:44:33] root: INFO: Running Minimap2
[2018-08-22 22:44:33] root: DEBUG: Running: flye-minimap2 /mnt/lustre/macmaneslab/devon/batgenomes/assembly/flye/mylu/wBroad/2-repeat/graph_paths.fasta /mnt/lustre/macmaneslab/devon/batgenomes/assembly/flye/mylu/rough/scaffolds.fasta /mnt/lustre/macmaneslab/devon/batgenomes/alignments/mylu/mylu_bwa/mylu_contigs.fa -a -Q -w5 -m100 -g10000 --max-chain-skip 25 -t 24 -Hk19
[2018-08-22 22:50:12] root: DEBUG: Sorting alignment file
[2018-08-22 22:50:23] root: INFO: Separating alignment into bubbles
[2018-08-22 23:04:14] root: INFO: Alignment error rate: 0.0391785863718
[2018-08-22 23:04:14] root: DEBUG: Generated 134656935 bubbles
[2018-08-22 23:04:14] root: DEBUG: Split 1268902 long bubbles
[2018-08-22 23:04:14] root: DEBUG: Skipped 1285290 empty bubbles
[2018-08-22 23:04:14] root: DEBUG: Skipped 653 bubbles with long branches
[2018-08-22 23:04:14] root: INFO: Correcting bubbles
[2018-08-23 00:13:51] root: INFO: Assembly statistics:

    Total length:   2409303436
    Contigs:    37069
    Scaffolds:  37045
    Scaffolds N50:  173192
    Largest scf:    1566349
    Mean coverage:  0
mikolmogorov commented 6 years ago

Thanks! The assembly looks fragmented (that what probably caused the slowdown) - what were the original N50s of the input contigs?

devonorourke commented 6 years ago

Just to clarify, the input contigs are the original Sanger-derived contigs, yes?

The Sanger-only contigs.fasta:

n   n:500   L50 min N80 N50 N20 E-size  max sum name
72784   72784   8574    601 24193   64330   133670  86581   673249  1.966e9 LUsanger-contigs.fa

The metrics for the Nanopore-only flye contigs.fasta:

n   n:500   L50 min N80 N50 N20 E-size  max sum name
15290   15277   1890    525 117947  300812  621345  389958  2406317 1.987e9 LUont-contigs.fasta

The Nanopore+Sanger combined contigs.fasta using flye --subassemblies:

n   n:500   L50 min N80 N50 N20 E-size  max sum name
37069   37064   3609    523 62947   173176  414185  256831  1566349 2.409e9 contigs.fasta

One of the curious things to me is that the genome sizes of the Nanopore-only and Sanger-only contigs are really, really similar. Once they are combined with --subassemblies the genome size appears to get a bit larger. I'm running Mummer's dnadiff program at the moment to get a better sense of whether these two independent assemblies contain a lot of different information.

I'm open to any directions you think might be useful for improving the assembly. Thanks!

mikolmogorov commented 5 years ago

The original issue seems to be resolved in the latest Flye versions. Please open a new one or write me an email if you have further questions.