Closed devonorourke closed 5 years ago
Hi Devon,
Glad that it worked well for you! Yes, you can definitely try --subassemblies
option for merging these two assemblies. You basically need to make a separate run and provide the two assemblies as input. One thing to keep in mind is that if there are structural differences between the two sets of contigs, they will be resolved arbitrary in the combined assembly (since there is no obvious way to check which of the two variants is correct). Also, the versions prior to 2.3.5 required to manually set minimum overlap parameter (-m) to 1000, but the latest one does this automatically.
Let me know if you have any extra questions, Mikhail
Cool. So let's say the original argument (just Nanopore data) was the following:
FASTX=path/to/nanopore_reads.fq.gz
OUTDIR=path/to/assembly/flye
flye --nano-raw $FASTX --out-dir $OUTDIR --genome-size 2g --threads 24 --iterations 3
This generated, among other things, a scaffolds.fasta
file. Let's say that lives here:
ONTASSEMBLY=/path/to/assembly/flye/scaffolds.fasta
The flye --help
menu suggests entering the path to a single fasta file for the --subassemblies
argument, but your response suggests I'll be inputting multiple assemblies. I'm unclear on how to include the two assemblies as input.
If you don't mind going along with my syntax as above, let's also say that the additional assembly from old Sanger data lives here:
OLDASSEMBLY=/path/to/assembly/sanger/scaffolds.fa
What I'd like to know is how to incorporate all these into a single argument to run the --subassemblies command.
Thanks again for your clarification.
You may provide multiple input files under the same argument:
flye --subassemblies $ONTASSEMBLY $OLDASSEMBLY -o $OUTDIR -g 2g -t 24 -i 3
Hope this helps!
Hi again Mikhail,
The --subassemblies
command seems to be hanging after calculating mean edge coverage. The .log
file contains a lot of info, but I've tried to parse out what I think is representative:
[2018-08-21 12:33:47] root: INFO: Running Flye 2.3.5-release
[2018-08-21 12:33:47] root: DEBUG: Cmd: /mnt/lustre/macmaneslab/devon/.conda/envs/mcscan-env/bin/flye --resume --subassemblies /mnt/lustre/macmaneslab/devon/batgenomes/assembly/flye/mylu/rough/scaffolds.fasta /mnt/lustre/macmaneslab/devon/batgenomes/alignments/mylu/mylu_bwa/mylu_contigs.fa --out-dir /mnt/lustre/macmaneslab/devon/batgenomes/assembly/flye/mylu/wBroad --genome-size 2g --threads 24
[2018-08-21 12:33:47] root: INFO: Resuming previous run
[2018-08-21 12:33:47] root: INFO: Performing repeat analysis
[2018-08-21 12:33:47] root: DEBUG: -----Begin repeat analyser log------
[2018-08-21 12:33:47] root: DEBUG: Running: flye-repeat -l /mnt/lustre/macmaneslab/devon/batgenomes/assembly/flye/mylu/wBroad/flye.log -t 24 /mnt/lustre/macmaneslab/devon/batgenomes/assembly/flye/mylu/wBroad/0-assembly/draft_assembly.fasta /mnt/lustre/macmaneslab/devon/batgenomes/assembly/flye/mylu/rough/scaffolds.fasta,/mnt/lustre/macmaneslab/devon/batgenomes/alignments/mylu/mylu_bwa/mylu_contigs.fa /mnt/lustre/macmaneslab/devon/batgenomes/assembly/flye/mylu/wBroad/2-repeat 2000000000 /mnt/lustre/macmaneslab/devon/.conda/envs/mcscan-env/lib/python2.7/site-packages/flye/resource/asm_subasm.cfg
[2018-08-21 12:33:47] DEBUG: Build date: Aug 8 2018 19:51:55
[2018-08-21 12:33:47] DEBUG: Parameters:
[2018-08-21 12:33:47] DEBUG: kmer_size=31
[2018-08-21 12:33:47] DEBUG: kmer_size_big=31
[2018-08-21 12:33:47] DEBUG: big_genome_threshold=50000000
[2018-08-21 12:33:47] DEBUG: low_cutoff_warning=0
[2018-08-21 12:33:47] DEBUG: hard_min_coverage_rate=50
[2018-08-21 12:33:47] DEBUG: low_minimum_overlap=1000
[2018-08-21 12:33:47] DEBUG: high_minimum_overlap=1000
[2018-08-21 12:33:47] DEBUG: assemble_kmer_sample=2
[2018-08-21 12:33:47] DEBUG: repeat_graph_kmer_sample=2
[2018-08-21 12:33:47] DEBUG: read_align_kmer_sample=2
[2018-08-21 12:33:47] DEBUG: maximum_jump=500
[2018-08-21 12:33:47] DEBUG: maximum_overhang=100
[2018-08-21 12:33:47] DEBUG: repeat_kmer_rate=100
[2018-08-21 12:33:47] DEBUG: assemble_ovlp_ident=0.02
[2018-08-21 12:33:47] DEBUG: repeat_graph_ovlp_ident=0.02
[2018-08-21 12:33:47] DEBUG: read_align_ovlp_ident=0.02
[2018-08-21 12:33:47] DEBUG: max_coverage_drop_rate=10
[2018-08-21 12:33:47] DEBUG: chimera_window=100
[2018-08-21 12:33:47] DEBUG: min_reads_in_contig=4
[2018-08-21 12:33:47] DEBUG: max_inner_reads=10
[2018-08-21 12:33:47] DEBUG: max_inner_fraction=0.25
[2018-08-21 12:33:47] DEBUG: add_unassembled_reads=1
[2018-08-21 12:33:47] DEBUG: max_separation=500
[2018-08-21 12:33:47] DEBUG: tip_length_threshold=1000
[2018-08-21 12:33:47] DEBUG: unique_edge_length=50000
[2018-08-21 12:33:47] DEBUG: min_repeat_res_support=0.51
[2018-08-21 12:33:47] DEBUG: out_paths_ratio=5
[2018-08-21 12:33:47] DEBUG: graph_cov_drop_rate=10
[2018-08-21 12:33:47] DEBUG: coverage_estimate_window=100
[2018-08-21 12:33:47] DEBUG: extend_contigs_with_repeats=0
[2018-08-21 12:33:47] DEBUG: Running with k-mer size: 31
[2018-08-21 12:33:47] INFO: Reading sequences
[2018-08-21 12:34:50] DEBUG: Selected minimum overlap 1000
[2018-08-21 12:34:50] INFO: Building repeat graph
[2018-08-21 12:34:50] DEBUG: Hard threshold set to 1
[2018-08-21 12:34:50] DEBUG: Started kmer counting
[2018-08-21 12:36:52] DEBUG: Repetetive k-mer frequency: 120
[2018-08-21 12:36:52] DEBUG: Filtered 134240 repetitive kmers (8.57721e-05)
[2018-08-21 12:37:06] DEBUG: Sampling rate: 2
[2018-08-21 12:37:06] DEBUG: Solid kmers: 175092058
[2018-08-21 12:37:06] DEBUG: Kmer index size: 432548113
[2018-08-21 12:37:06] DEBUG: Mean k-mer frequency: 2.4704
[2018-08-21 12:38:27] DEBUG: Sorting k-mer index
[2018-08-21 12:44:19] DEBUG: Computing transitive closure for overlaps
[2018-08-21 12:44:23] DEBUG: Found 3229010 overlaps
[2018-08-21 12:44:24] DEBUG: Left 1477636 overlaps after filtering
[2018-08-21 12:44:25] INFO: Median overlap divergence: 0.0160111
[2018-08-21 12:44:25] DEBUG: Sequence divergence distribution:
| *
| *
| **
| **
| **
| **
| **
| **
| **
| **
| **
| **
| **
| ****
| ****
| ****
| *****
| *****
| *********
|************************* *
----------------------------------------------------------------------------------------------------
0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
Q25 = 0.013, Q50 = 0.016, Q75 = 0.022
[2018-08-21 12:44:25] DEBUG: Computing gluepoints
[2018-08-21 12:44:31] DEBUG: Created 913486 gluepoints
[2018-08-21 12:44:34] DEBUG: Tandems removed: 947 left, 947 right, 22828 both
[2018-08-21 12:44:34] DEBUG: Initializing edges
[2018-08-21 12:49:27] DEBUG: Repetetive k-mer frequency: 116
[2018-08-21 12:49:27] DEBUG: Filtered 121037 repetitive kmers (0.000105844)
[2018-08-21 12:49:41] DEBUG: Sampling rate: 2
[2018-08-21 12:49:41] DEBUG: Solid kmers: 1143420396
[2018-08-21 12:49:41] DEBUG: Kmer index size: 1281198685
[2018-08-21 12:49:41] DEBUG: Mean k-mer frequency: 1.1205
[2018-08-21 12:53:26] DEBUG: Sorting k-mer index
[2018-08-21 12:57:23] DEBUG: Total reads : 86779
[2018-08-21 12:57:23] DEBUG: Read with aligned parts : 86263
[2018-08-21 12:57:23] DEBUG: Aligned in one piece : 76102
[2018-08-21 12:57:23] INFO: Aligned read sequence: 3919805131 / 3951871696 (0.991886)
[2018-08-21 12:57:24] INFO: Median overlap divergence: 0.0111136
[2018-08-21 12:57:24] DEBUG: Sequence divergence distribution:
| *
| *
| *
| *
| *
| *
| *
| **
| **
| **
| **
| ***
| ***
| ***
| ***
| ****
| **** *****
| ************
|**************
|**************************
----------------------------------------------------------------------------------------------------
0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
Q25 = -0.00033, Q50 = 0.011, Q75 = 0.024
[2018-08-21 12:57:39] INFO: Mean edge coverage: 1
[2018-08-21 12:57:39] DEBUG: -234885 len:150666 cov:2 mult:2
[2018-08-21 12:57:39] DEBUG: 234885 len:150666 cov:2 mult:2
[2018-08-21 12:57:39] DEBUG: -234884 len:1289 cov:2 mult:2
[2018-08-21 12:57:39] DEBUG: 234884 len:1289 cov:2 mult:2
...
many more lines
...
[2018-08-21 13:37:55] DEBUG: Contig: 58726: 58726
[2018-08-21 13:37:55] DEBUG: Contig: 58758: 58758
[2018-08-21 13:37:55] DEBUG: Contig: 58759: 58759
[2018-08-21 13:37:55] DEBUG: Contig: 58761: 58761
[2018-08-21 13:37:55] DEBUG: Contig: 58762: 58762
[2018-08-21 13:37:55] DEBUG: Contig: 58768: 58768
[2018-08-21 13:37:55] DEBUG: Contig: 58798: 58798
[2018-08-21 13:37:55] DEBUG: Contig: 58804: 58804
[2018-08-21 13:37:55] DEBUG: Contig: 58811: 58811
[2018-08-21 13:37:55] DEBUG: Contig: 58819: 58819
[2018-08-21 13:37:55] DEBUG: Contig: 58826: 58826
[2018-08-21 13:37:55] DEBUG: Contig: 58841: 58841
[2018-08-21 13:37:55] DEBUG: Contig: 58846: 58846
[2018-08-21 13:37:55] DEBUG: Contig: 58882: 58882
[2018-08-21 13:37:55] DEBUG: Contig: 58890: 58890
[2018-08-21 13:37:55] DEBUG: Contig: 58922: 58922
[2018-08-21 13:37:55] DEBUG: Contig: 58927: 58927
[2018-08-21 13:37:55] DEBUG: Contig: 58931: 58931
[2018-08-21 13:37:55] DEBUG: Contig: 58934: 58934
[2018-08-21 13:37:55] DEBUG: Contig: 58971: 58971
This .log
file represents a restart. The older log file produced the exact same stop point (same values for Contig: ...
. However the previous run was stuck here for about 18 hours.
The stderr log file shows this:
[2018-08-21 12:33:47] INFO: Running Flye 2.3.5-release
[2018-08-21 12:33:47] INFO: Resuming previous run
[2018-08-21 12:33:47] INFO: Performing repeat analysis
[2018-08-21 12:33:47] INFO: Reading sequences
[2018-08-21 12:34:50] INFO: Building repeat graph
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[2018-08-21 12:44:25] INFO: Median overlap divergence: 0.0160111
[2018-08-21 12:47:34] INFO: Aligning reads to the graph
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
[2018-08-21 12:57:23] INFO: Aligned read sequence: 3919805131 / 3951871696 (0.991886)
[2018-08-21 12:57:24] INFO: Median overlap divergence: 0.0111136
[2018-08-21 12:57:39] INFO: Mean edge coverage: 1
[2018-08-21 13:05:17] INFO: Resolving repeats
[2018-08-21 13:34:27] INFO: Generating contigs
[2018-08-21 13:37:21] INFO: Generated 47845 contigs
Any thoughts? Perhaps I'm missing an important argument in the command?
Cheers
I don't think it hang, it rather being extremely slow.. Looks like some functions in the very end were not optimized for assemblies with large number of contigs. We will try to fix it in the future. In the mean time, do you have a file "2-repeat/graph_paths.fasta" generated? If so, it is basically your final assembly (no need to wait until the rest of pipeline finishes).
Ok, good to know. I'll leave the job for a while and see if anything changes. There is indeed a completed graph_paths.fasta
.
Could you please point me to any documentation or scripts that outline exactly what is going on when I run --subassemblies
? The main python script seems to look for the subasm
argument, but I couldn't figure out where the script goes from there. I'm just trying to understand more about the nature of this assembly option.
Thanks very much
This is basically the same assembly algorithm - input contigs are treated as "reads". The difference is that some parameters are tuned for high quality sequence with almost no errors (different configuration file from "flye/resource" is loaded).
Quick update: the program did indeed finish running. It appears that the step in which the DOT file is written was the major hangup - it took about 30 hours to get that step done. Here's a little snippet of the log file where you can see that lag time in the process:
[2018-08-21 17:27:22] DEBUG: Contig: 58981: 58981
[2018-08-21 17:27:22] DEBUG: Contig: 58989: 58989
[2018-08-22 22:33:48] DEBUG: Writing Dot
[2018-08-22 22:35:59] DEBUG: Writing FASTA
[2018-08-22 22:37:26] DEBUG: Writing Gfa
-----------End assembly log------------
[2018-08-22 22:44:33] root: INFO: Polishing genome (1/1)
[2018-08-22 22:44:33] root: INFO: Running Minimap2
[2018-08-22 22:44:33] root: DEBUG: Running: flye-minimap2 /mnt/lustre/macmaneslab/devon/batgenomes/assembly/flye/mylu/wBroad/2-repeat/graph_paths.fasta /mnt/lustre/macmaneslab/devon/batgenomes/assembly/flye/mylu/rough/scaffolds.fasta /mnt/lustre/macmaneslab/devon/batgenomes/alignments/mylu/mylu_bwa/mylu_contigs.fa -a -Q -w5 -m100 -g10000 --max-chain-skip 25 -t 24 -Hk19
[2018-08-22 22:50:12] root: DEBUG: Sorting alignment file
[2018-08-22 22:50:23] root: INFO: Separating alignment into bubbles
[2018-08-22 23:04:14] root: INFO: Alignment error rate: 0.0391785863718
[2018-08-22 23:04:14] root: DEBUG: Generated 134656935 bubbles
[2018-08-22 23:04:14] root: DEBUG: Split 1268902 long bubbles
[2018-08-22 23:04:14] root: DEBUG: Skipped 1285290 empty bubbles
[2018-08-22 23:04:14] root: DEBUG: Skipped 653 bubbles with long branches
[2018-08-22 23:04:14] root: INFO: Correcting bubbles
[2018-08-23 00:13:51] root: INFO: Assembly statistics:
Total length: 2409303436
Contigs: 37069
Scaffolds: 37045
Scaffolds N50: 173192
Largest scf: 1566349
Mean coverage: 0
Thanks! The assembly looks fragmented (that what probably caused the slowdown) - what were the original N50s of the input contigs?
Just to clarify, the input contigs are the original Sanger-derived contigs, yes?
The Sanger-only contigs.fasta:
n n:500 L50 min N80 N50 N20 E-size max sum name
72784 72784 8574 601 24193 64330 133670 86581 673249 1.966e9 LUsanger-contigs.fa
The metrics for the Nanopore-only flye
contigs.fasta:
n n:500 L50 min N80 N50 N20 E-size max sum name
15290 15277 1890 525 117947 300812 621345 389958 2406317 1.987e9 LUont-contigs.fasta
The Nanopore+Sanger combined contigs.fasta using flye --subassemblies
:
n n:500 L50 min N80 N50 N20 E-size max sum name
37069 37064 3609 523 62947 173176 414185 256831 1566349 2.409e9 contigs.fasta
One of the curious things to me is that the genome sizes of the Nanopore-only and Sanger-only contigs are really, really similar. Once they are combined with --subassemblies
the genome size appears to get a bit larger. I'm running Mummer's dnadiff
program at the moment to get a better sense of whether these two independent assemblies contain a lot of different information.
I'm open to any directions you think might be useful for improving the assembly. Thanks!
The original issue seems to be resolved in the latest Flye versions. Please open a new one or write me an email if you have further questions.
Hi Mikhail, Flye is working great for my mammalian genome assemblies (using raw Nanopore data). One of the species I'm working on has an existing reference completed over a decade ago with Sanger data and I was wondering if the subassemblies argument in Flye could work by combining my Nanopore data assembled with Flye with these Sanger-derived contigs? Where would that process start within Flye? Thanks very much, Devon