rrwick / Unicycler

hybrid assembly pipeline for bacterial genomes
GNU General Public License v3.0
536 stars 132 forks source link

Error: miniasm assembly failed #38

Closed ghost closed 6 years ago

ghost commented 6 years ago
Assembling contigs and long reads with miniasm (2017-07-24 18:16:16)

Saving to /miniasm_assembly/01_assembly_reads.fastq:
  61,957 long reads

Finding overlaps with minimap... 
success
  520,428,235 overlaps

Assembling reads with miniasm... 
empty result
Error: miniasm assembly failed

Is there any more info on why this happened? Because that's all it says. Thx.

rrwick commented 6 years ago

Sorry, that's not a very informative error message! It's something of a catch-all for any case where miniasm failed to generate an assembly graph.

My first suspicion would be low depth. How many bp of long reads do you have and how big is your genome? Miniasm won't have much luck with low depth long reads, so in that case a failed miniasm assembly wouldn't surprise me. And if this is a hybrid assembly, then a failed miniasm assembly isn't necessarily a problem - Unicycler can use other means to scaffold the Illumina assembly graph.

If, however, your long reads are decently high depth (20x or more) and it still failed, then I'm not so sure and we'll have to dig deeper. Let me know!

Ryan

ghost commented 6 years ago

Actually it's with 20,000x :sweat_smile: I wanted to see if the assembly result with all reads will be the same as with subsampled sets.

It's an artificial construct of around 50kb. I ran it with 100x before and it gave me two circular unitigs of different lengths - 35kb and 28kb. They have different coverage reported in the end - 1.00x and 0.01x though I am not sure what is actually happening in this "rotating circular replicons" section and where this coverage, which is being reported there, comes from.

rrwick commented 6 years ago

Oh wow! I'm not sure how miniasm will handle either such high depths or such a small genome. When I've tried Unicycler on small datasets (like just plasmid sequences) I've noticed that the miniasm step usually fails.

Do you have Illumina reads for your sample? I.e. are you doing a hybrid assembly? If so, it's not too tragic if miniasm fails, because the miniasm assembly is just one way that Unicycler scaffolds a hybrid assembly together. But if you're doing a long-read-only assembly, then a failed miniasm assembly is the end of the line.

I'm not exactly sure why it might be failing because I don't understand all of miniasm's inner workings. And since Unicycler is intended as an assembler only for bacterial genomes, a failure on a 50 kbp construct doesn't worry me too much. I might suggest Canu instead?

I'm going to close this issue now, but I will add this to my list of future improvements for Unicycler: a better understanding and reporting of miniasm's reasons for failure. Thanks!

rrwick commented 6 years ago

One more thought: your very uneven depth of coverage values (1.00 and 0.01) make me wonder if your subsampled 100x reads might not be an even representation of your construct? Maybe try different subsamples? Or maybe try subsampling for reads of a certain size, like 10 kbp?

Chimeric Nanopore reads (reads that are two separate sequences artificially glued together) definitely happen, so perhaps that's causing a problem. Again, give Canu a try: it's pretty good about splitting up chimeras.

ghost commented 6 years ago

No, this is ONT-only, no ILMN available. Because it is such a small product it's not hard to get enough coverage so kinda pointless to invest into additional ILMN because one nanopore read should be able to cover the complete plasmid. Don't see why the assembly is being weird or why the assemblers are having a hard time.

The overall read distribution is sort of "bimodal". There's a small tail 0-32 kb and the majority of reads is 32-35 kb -- I sampled from there. I also did a Canu assembly on the same subset and it produces a single circular unitig which is really just these two unitigs, which I mentioned above, but concatenated together. This is why it's puzzling to me why a different coverage is being reported for these two assemblies.

PS. I use Porechop as a preprocessing step and discard chimeric reads where the adapter was at least identified somewhere other than the beggining or end of the read.

rrwick commented 6 years ago

Porechop removes chimeras where there's an adapter in the middle, but that doesn't account for all chimeras. There can be non-middle-adapter chimeras too. But I have no idea if that's the source of your problem - that was just speculation.

Does your construct have some large repeat in it? Unicycler assigns read depth to miniasm assemblies when it does Racon polishing. If the two pieces of your assembly have tons of sequence in common, it may be that the minimap is aligning them unevenly...

I don't mean to deflect, but I suspect Unicycler just isn't the right tool for your case. It's a bacterial whole genome assembler and was only designed and tested with those datasets in mind. Canu is a more general long read assembler and is probably more appropriate here.

Some final thoughts, which could apply to either Canu or Unicycler:

Ryan

ghost commented 6 years ago

Yeah, I agree it definitely isn't 100% chimera-free, but at least the most obvious ones were removed.

There should be repeats but no repeat is larger than the reads I used in this case (30 kb), so I didn't expect that there would be any issues on that side.

I've already subsampled only reads with >Q9 but I guess I could also try higher quality reads. Well the idea was to use at little sequencing time as possible and realistically there won't be too many high quality reads in a dataset.

Anyway, thanks Ryan for taking time to give suggestions!

zachcp commented 6 years ago

Hi @rrwick and @DNAsaurus,

I've run into the same issue and I think it has to do with coverage/pruning during miniasm. For example, I can get output from a small genome if I specify a low coverage whereas the default fails.

It would be nice to be able to pass in args to miniasm through unicycler. While I agree that Canu may be better for this, its hard to argue with the minasm speedup.

# fails (output is empty)
miniasm -f smallgneome.fasta reads.paf.gz > reads.gfa

# passes
miniasm -c 2 -e 2 -f 201710211_full_1d2_vectorsubtracted.fasta reads.paf.gz > reads.gfa
rrwick commented 6 years ago

Thanks for the example. I like the idea, and I'll add "pass miniasm options" to my list of future Unicycler enhancements!

ExplodingCabbage commented 4 years ago

Just experienced the same failure mode when trying Unicycler for the first time. In case it helps, here's my miniasm.log:

$ cat miniasm.out 
===> Step 1: reading read mappings <===
[M::read_hits_file::1.256*45.56] read 2819185 hits; stored 154231 hits and 1196 sequences (2627115 bp)

===> Step 2: 1-pass (crude) read selection <===
[M::filter_reads_using_depth::1.308*43.78] 1069 query sequences remain after sub
[M::filter_hits_using_span::1.310*43.71] 153629 hits remain after cut
[M::filter_hits_using_overhang::1.312*43.63] 153629 hits remain after filtering; crude coverage after filtering: 73.5302

===> Step 3: 2-pass (fine) read selection <===
[M::filter_reads_using_depth::1.349*42.46] 1038 query sequences remain after sub
[M::filter_hits_using_span::1.352*42.38] 150010 hits remain after cut
[M::remove_contained_reads::1.381*41.50] 5 sequences and 0 hits remain after containment removal

===> Step 4: graph cleaning <===
[M::make_string_graph] read 0 arcs

===> Step 4.1: transitive reduction <===
[M::asg_arc_del_trans] transitively reduced 0 arcs

===> Step 4.2: initial tip cutting and bubble popping <===
[M::cut_tips] cut 5 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 0 asymmetric arcs
[M::pop_bubbles] popped 0 bubbles and trimmed 0 tips

===> Step 4.3: cutting short overlaps (%d rounds in total) <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps

===> Step 4.4: removing short internal sequences and bi-loops <===
[M::cut_short_internal] cut 0 internal sequences
[M::cut_biloops] cut 0 small bi-loops
[M::cut_tips] cut 0 tips
[M::pop_bubbles] popped 0 bubbles and trimmed 0 tips

===> Step 4.5: aggressively cutting short overlaps <===
[M::asg_arc_del_short] removed 0 short overlaps

Real time: 1.67508 sec; CPU: 57.6117 sec
zhaoxvwahaha commented 1 year ago

@rrwick Hi, Rrwick, do you add "pass miniasm options" to unicycler, I download the newest version but can't find related paras.